So maybe the consequences of an outage in your infrastructure will not be as calamitous as the BP oil spill in the Gulf, but the effect on your enterprise may feel like it. Which is why we could all use B.P. as a case study in how not to treat your monitoring.
News reports detail allegations that some of the alarms on the failed rig were “inhibited”, because “they did not want people to wake up at 3 a.m. due to false alarms”. The rig’s operator responded, “Repeated false alarms increase risk and decrease rig safety”.
We’ve said it before and we’ll probably say it again, but alert overload is dangerous, no matter what the occupation. Even if your field of responsibility is for ensuring the performance and availability of network switch monitoring, you need to ensure all alerts are meaningful.
This requires two things:
- processes that are followed (reviewing alerts, reports on top alert sources, etc)
- tools that allow you tune and route your alerts. (One area I’d include in this is powerful alert calculations – it is often helpful to be able to express alerts such as “Warn if discards is more than 0.5% of traffic – but only if traffic is greater than 1 Mbps inbound”.)
If your monitoring system can’t do this kind of alerting, and easily let you tune it on group, host or instance level, you’re either wasting time dealing with false alerts, or have “inhibited” the alarms on your own oil rig. And we’ve all seen where that leads….
Want to make your monitoring less lethal? Check out LogicMonitor for monitoring that automates your monitoring setup, and lets you only get the alerts that matter.