Why CPU load should not (usually) be a critical alert.

One question that often arises in monitoring is how to define alert levels and escalations, and what level to set various alerts at – Critical, Error or Warning.

Assuming you have Errors and Critical alerts set to notify teams by pager/phone, and Critical alerts with a shorter escalation time, here are some simple guidelines:

Critical alerts should be for events that have immediate customer impacting effect. For example, a production Virtual IP on a monitored load balancer going down, as it has no available services to route the traffic to. The site is down, so page everyone.

Error alerts should be for events that require immediate attention, and that, if unresolved, increase the likelihood that a production affecting event will occur. To continue with the load balancer example, an error should be triggered if the Virtual IP only has one functioning backend server to route traffic to – there is now no redundancy, so one failure can take the site offline.

Warnings, which we typically recommend be sent by email only, are for all other kinds of events. The loss of a single backend server from a Virtual IP when there are 20 other servers functioning does not warrant anyone being woken in the night.

When deciding what level to assign alerts, consider the primary function of the device. For example, in the case of a NetApp storage array, the function of the device is to serve read and write IO requests. So the primary thing for monitoring NetApps should be the availability and performance (latency) of these read and write requests. If a volume is servicing requests with high latency – such as 70 ms per write request – that should be an Error level alert (in some enterprises, that may be appropriate to configure as a Critical level alert, but usually a Critical performance alert should be triggered only if the end-application performance degrades unacceptably.) However, if CPU load on the NetApp is 99% for a period, even though it sounds alarming, I’d suggest that be treated as a Warning level alert only. If latency is not impacted, why wake people at night? Send an email alert so the issue can be investigated, but if the function of the device is not impaired, do not over react. (If you wish, you can define your alert escalations so that such conditions result in pages if uncorrected or unacknowledged for more than 5 hours, say.)

Alert Overload is a bigger danger to most datacenters than most people realise. The thought is often “if one alert is good, more must be better.” Instead, focus on identifying the primary functions of devices – set Error level alerts on those functions, and use Warnings to inform you about conditions that could impair that functions, or to aid in troubleshooting. (If latency on a NetApp is high, and CPU load is also in alert, that obviously helps diagnose the issue, instead of looking for unusual volume activity.)

Reserve Critical alerts for system performance and availability as a whole.

With LogicMonitor hosted monitoring, the alert definitions for all data center devices have their alert thresholds predefined in the above manner – that’s one way we help provide meaningful monitoring in minutes.

Best Practices 11 min read

UDM Pro Memory Usage: How to Monitor and Fix Performance Spikes

Struggling with high memory or CPU on your UDM Pro? Here’s how to monitor usage, catch issues early, and avoid...

News and Development 5 min read

Logic Success Stories: How LM Logs Cut MTTR and Boosted IT Clarity in 2024

LogicMonitor reflects on several product innovations and accomplishments within LM Logs in 2022, along with how these helped customers.

Best Practices 13 min read

Remote infrastructure management: Trends, challenges, and the future of IT

In this guide, learn the current state of remote infrastructure management, then discover what the future holds.

Subscribe to our blog

Get articles like this delivered straight to your inbox

Platform

Infrastructure Monitoring

Cloud Monitoring

Digital Experience

AIOPS

Solutions

By Initiative

By Industry

Resources

Learn

About us

Get to know us

Services

Documentation

Support

Why CPU load should not (usually) be a critical alert.

In this article

UDM Pro Memory Usage: How to Monitor and Fix Performance Spikes

Logic Success Stories: How LM Logs Cut MTTR and Boosted IT Clarity in 2024

Remote infrastructure management: Trends, challenges, and the future of IT

Subscribe to our blog

Why CPU load should not (usually) be a critical alert.

In this article

UDM Pro Memory Usage: How to Monitor and Fix Performance Spikes

Logic Success Stories: How LM Logs Cut MTTR and Boosted IT Clarity in 2024

Remote infrastructure management: Trends, challenges, and the future of IT

Subscribe to our blog

Start Your Trial

Thank You.