Continuing on the series of common Datacenter monitoring mistakes…
This is one of the most dangerous conditions. If you have too many noisy alerts, that go off too frequently, people will tune them out – then when you get real, service impacting alerts, they will be tuned out, too. I’ve seen critical production service outage alerts be put into scheduled downtime for 8 hours, as the admin assumed it was “another false alert”. How to prevent this?
- start with sensible defaults, and sensible escalation policies. Distinguish between warnings (that admins should be aware of, but do not require immediate actions) and error or critical level alerts, that require pager notifications. (No need to awaken people if NTP is out of synchronization – but if the primary database volume is experiencing 200 ms latency for read requests from its disk storage, and end user transaction time is now 8 seconds, then all hands on deck).
- route the right set of alerts to the right group of people. There is no point in the DBA being alerted about network issues, or vice versa.
- make sure you tune your thresholds appropriately. Every alert should be real and meaningful. If any alerts are ‘false positives’ (such as alerts about QA systems being down), tune the monitoring. LogicMonitor alerts are easily tuned on the global, host or group level, or even the individual instance (such as a file system, or interfaces); and ActiveDiscovery filters make it simple classify discovered instances into the appropriate group, with the appropriate alert levels. A common example is to discover all load balancing VIPs or Storage system volumes with “stage” or “qa” in the name to have no error or critical alerts – this will then apply to all VIPs or volumes created now and in the future, on all devices – greatly simplifying alert management.
- ensure alerts are acknowledged, dealt with, and cleared. You don’t want to see hundreds of alerts on your monitoring system. For large enterprises, make sure you can see a filtered view of just the groups of systems you are responsible for, allowing focus. You should also periodically sort alerts by duration, and focus on cleaning out those that have been in alert for longer than a day.
- Another useful report is to analyze your top alerts, by host, or by alert type. Investigate to see whether there are issues in monitoring, or the systems, or operational processes, that can reduce the frequency of these alerts.
Steve Francis is an employee at LogicMonitor.
Subscribe to our LogicBlog to stay updated on the latest developments from LogicMonitor and get notified about blog posts from our world-class team of IT experts and engineers, as well as our leadership team with in-depth knowledge and decades of collective experience in delivering a product IT professionals love.