Top I.T./Datacenter Monitoring Mistakes, Part 1 in a series.

LogicMonitor Opinion post

Everyone knows they need monitoring to ensure their site uptime and keep their business humming.  Yet many sites still suffer from outages that are first reported by their customers.  Here at LogicMonitor, we have lots of experience with monitoring systems of all kinds, and these are some of the most common mistakes we have seen, and how to address them – so that you can know about issues before they affect your business:

Relying on people and human processes to ensure things are monitored.
People are funny, lovable, amazing creatures, but they are not very reliable.  A situation we have seen many times is that during the heat of a crisis (say, you were lucky enough to get Slashdotted), some change is made to some data center equipment (such as adding a new volume to your NetApp so that it can serve as high speed storage for your web tier).  But in the heat of the moment, the new volume is not put into your NetApp monitoring.  (“I’ll get to that later” are famous last words.)
After the crises is over, everyone is too busy breathing sighs of relief to worry about adding that new volume into monitoring – so when it fills up in 6 months, or starts having latency issues due to high IO operations,  no one is alerted, and customers are the first to call in and complain. The CTO is the next to call.  Uh oh.

One of LogicMonitor’s design goals has always been to remove human configuration as much as possible – not just because it saves people time, but because it makes monitoring – and hence the services monitored – that much more reliable.  We do this in a few different ways:

  • LogicMonitor’s ActiveDiscovery (TM) process continuously scans all monitored devices for changes, automatically adding new volumes, interfaces, load balancer VIPs, database instances, etc into monitoring, and informing you via email in real time (or batched notifications, as you prefer).  However, in order to avoid Alert Overload, you’ll need to ensure your monitoring supports filtering and classification of discovered objects.
  • LogicMonitor can scan your subnets, or even your Amazon EC2 account, and add new machines or instances to monitoring automatically.
  • Just as important, graphs and dashboards should not have to be manually updated.  If you have a dashboard graph that is the sum of the sessions on 10 web servers,  that is commonly used to view the health of your service, what happens when you add 4 more servers? A good monitoring system will automatically add them to your dashboard graph.  A poor system will require you to remember to update all the places that such information is aggregated, virtually ensuring your “business overview” will be anything but.

In short, never depend on monitoring to be manually updated to cover adds, moves and changes. Because you know it doesn’t happen.