Why CPU load should not (usually) be a critical alert.

One question that often arises in monitoring is how to define alert levels and escalations, and what level to set various alerts at – Critical, Error or Warning.

Assuming you have Errors and Critical alerts set to notify teams by pager/phone, and Critical alerts with a shorter escalation time, here are some simple guidelines:

Critical alerts should be for events that have immediate customer impacting effect.  For example, a production Virtual IP on a monitored load balancer going down, as it has no available services to route the traffic to.  The site is down, so page everyone.

Error alerts should be for events that require immediate attention, and that, if unresolved, increase the likelihood that a production affecting event will occur.  To continue with the load balancer example, an error should be triggered if the Virtual IP only has one functioning backend server to route traffic to – there is now no redundancy, so one failure can take the site offline.

Warnings, which we typically recommend be sent by email only, are for all  other kinds of events.  The loss of a single backend server from a Virtual IP when there are 20 other servers functioning does not warrant anyone being woken in the night.

When deciding what level to assign alerts, consider the primary function of the device.  For example, in the case of a NetApp storage array, the function of the device is to serve read and write IO requests.  So the primary thing for  monitoring NetApps should be the availability and performance (latency) of these read and write requests. If a volume is servicing requests with high latency – such as 70 ms per write request – that should be an Error level alert (in some enterprises, that may be appropriate to configure as a Critical level alert, but usually a Critical performance alert should be triggered only if the end-application performance degrades unacceptably.)  However, if CPU load on the NetApp is 99% for a period, even though it sounds alarming, I’d suggest that be treated as a Warning level alert only.  If latency is not impacted, why wake people at night?  Send an email alert so the issue can be investigated, but if the function of the device is not impaired, do not over react. (If you wish, you can define your alert escalations so that such conditions result in pages if uncorrected or unacknowledged for more than 5 hours, say.)

Alert Overload is a bigger danger to most datacenters than most people realise. The thought is often “if one alert is good, more must be better.”  Instead, focus on identifying the primary functions of devices – set Error level alerts on those functions, and use Warnings to inform you about conditions that could impair that functions, or to aid in troubleshooting. (If latency on a NetApp is high, and CPU load is also in alert, that obviously helps diagnose the issue, instead of looking for unusual volume activity.)

Reserve Critical alerts for system performance and availability as a whole.

With LogicMonitor hosted monitoring, the alert definitions for all data center devices have their alert thresholds predefined in the above manner – that’s one way we help provide meaningful monitoring in minutes.