The most important monitoring report that you are not using

Even with a great monitoring system, it can be hard sometimes to keep the noise down. (Indeed, the more powerful the monitoring, the more difficult this can be, as more data is collected and tested, automatically.) And keeping noise down in monitoring is vital, as you do not want staff to start ignoring alerts – which they will if there are too many meaningless alerts.

There are of course best practices to help with this process, but one of the best ways to start attacking your alert noise is also one of the easiest – simply set up a report to highlight where the noise is coming from, and review it once a week.

Under the Reports tab, select “Add” and then “Report.” You will be shown a table of pre-built report templates from which you should choose “Alerts” and fill it out as shown below.

Picture1

Picture1

I suggest setting the report to cover the last week, for all hosts (although if you are responsible only for a set of hosts – by all means change the report to only reflect those you are getting alerted about); exclude alerts that occurred during periods of Scheduled DownTime (those alerts would not have been sent out anyway); check the Summarize Alert Counts box, THEN select the sort method of sorting by Alert count. (This sort order is not available until the summarize alert count box is checked.)

Run this report, and you’ll get output like the below:

fc9c8af0-6033-4522-a4d7-445163533b03

Which makes it very easy to see that in this case, we could eliminate 80% of the alerts for the last week simply by changing the monitoring on the IPMI event logs of one development host – filtering out alerts, or using SDT, or even disabling that monitoring, given it’s just a development host.

We can then work through the top noise makers, tuning, disabling, or fixing issue, which will greatly reduce the amount of alert noise with the least work.

And then we’ll get this report emailed to us every Monday, so we can stay on top of the issues, and keep our monitoring meaningful. That way, we’ll have improved the performance of our systems, eliminated any alert noise, and if we do get an alert – we can be sure it’s meaningful, and that people will react to it.