Network Monitoring .. Not all automation is good

LogicMonitor is, as far as I know, the most automated network monitoring system out there.  But there is one area we don’t provide much in the way of automation, that we are often asked about – automated scripts in response to alerts.  There are few reasons why not, which flow from our experience running critical production datacenters:

  • There are many cases where you don’t want automated recovery – you want a human to pinpoint the cause of failure, and ensure the recovery is done safely.   e.g.  after a master database crash, many DBAs don’t want to restart the database without determining the cause, whether transactions need to be backed out, whether slaves are still valid replicas, etc.
  • If a system is important enough to need automated recovery, the right way to do that is to have standby systems, clustered or otherwise available. e.g. multiple web servers behind a load balancer; master-master databases; switches with rapid spanning tree; routers with a rapidly converging IGP (OSPF, EIGRP).
  • If a service or process does need to be automatically restarted on a host, the monitoring system is almost certainly not the right way to do it. Use daemon-tools or init on Linux, or configure the service to restart in the Services control panel on Windows.  Using the monitoring system to attempt to remediate this will necessarily be a more fragile system than OS level tools.
  • If there are processes that need to be killed and restarted in response to the state of monitored metrics – if memory leaks and grows too much, say (I’m looking at you, mongrel) – then use a tool designed for that – monit, say.

In all these cases, use your monitoring to tell you if your recovery mechanisms are working, not to be the recovery mechanisms.  Monitor the memory usage of your mongrel processes, and alert only if the memory consumption is higher than you expect, for longer than it should be if monit was doing it’s job, say.

Of course, LogicMonitor can trigger automated script actions in response to alerts – you can set an agent inside your datacenter to pull all the alerts, send them to a script, which can do … whatever you can script.  And there are cases where that’s appropriate.  But you should have a good think about your architecture and design before you leap to that as a first resort.