Why you need monitoring even when you’ve had no outages

As sometimes happens, a tech guy at a company tried LogicMonitor, loved it, and then met some resistance from management. In this case, the pushback was “We haven’t had any outages. Why do we need monitoring?”

They already have a monitoring system – the users of their systems (whether thats customers or internal staff.)  If they are OK with their users being the first to report outages, then perhaps they don’t need monitoring.  If that’s true, perhaps the application doesn’t really matter, and they shouldn’t even be running it and consuming server resources. But if they are running it, then some users are going to be depending on it.  In the event of an outage, these users will either not become customers and give the company revenue, as the service is down, or they will be internal staff blocked from their work – at a huge overhead cost. (Twenty employees blocked for an hour adds up to a lot of wasted payroll and lost activity).

There are only a few ways to have avoided outages without having good monitoring:

  • having someone check stuff frequently, so they are not hitting growth bottlenecks, and are aware of failures ASAP
  • having a lot of redundancy
  • being lucky and gambling

In the first case, they have monitoring – they just aren’t calling it that, as they are doing it manually. If they are having people do the checking, it doesn’t scale. If they are growing at all, then soon their next server or network device or application will not cost them $5000, it will cost them $100,000 per year, as they are going to have to scale headcount to maintain the level of service and manual checking. (Not to mention that it is almost certainly being done far more superficially than an automated system that would be checking hundreds of datapoints per system.)

If they have redundancy – they need to be checking more stuff, even more frequently.  A redundant router (or power supply, or server, or raid system) can take over and mask the failure of another router (power supply|server|disk) – but that means it can become non-trivial to know that you’ve had a failure, and currently have NO redundancy. This means in the event of another failure – and you have a hard outage.  Even worse, the operational procedures are probably assuming you have redundancy, and are not set up or staffed to address immediate service impacting issues – so do not handle failures in redundant systems that cause outages in a timely manner.

If they are relying on being lucky – well, given the failure rate of all data center devices is significantly  above zero (>8% per year for hard drives older than 3 years for some drives, per Google), and that all components will over time regress to the mean – then their chance of failure is now increasing. (Unlike roulette, where 5 prior results of black have no influence on the next, 5 prior years of no failure do influence the chance of a near term failure.)  That’s not a gamble I’d put my businesses money on.

Outages are the easy, and obvious case. Unless their workload is practically unchanging, they will be hitting performance issues. Without performance monitoring, they will be guessing as to where the limitations are. (We have seen cases have been cases of people about to spend tens of thousands to update CPU resources, when in fact the limitation of was storage IO capacity.)

And then there are the hard issues where there are partial failures.  e.g. direct example from my prior life: a load balancer that is up, and serving traffic mostly correctly, but dropping a lot of UDP traffic as it has hit an internal resource limit. So while web pages were being served, geolocation data (carried over UDP) was not working, leading to some incorrect or missing content on those web pages. (And a lot of missed revenue.) With monitoring, alerts triggered about the consumed resources, and the configuration was able to be adjusted to the workload. Without monitoring, it could have taken hours just to identify where to start looking.  Monitoring can mean the difference between a 10 minute impact, or a days long event.

And in the softer side – if they want to keep good employees in the networking and sysadmin fields, they need to empower them and give them tools.  The people responsible for their systems should have assurance that things are working, even in the night, without having to check.  And if there is an issue, they should have the information to resolve it quickly. Even at 2:00 am.  And not be woken by customers, users or their bosses complaining about issues that are not in their control. That’s a good way to lose good staff.

So. Is it worth investing in a good monitoring system for your servers, infrastructure and applications? Only if you want to keep your employees, users and revenue.