Most organizations know that they need monitoring to ensure site uptime and keep their business running. Yet, many sites still suffer from outages first reported by their customers due to small mistakes made with their monitoring systems. Monitoring mistakes are easy to make and easy to overlook, but the consequences can be detrimental. Here are some of the most common monitoring mistakes and how to address them.
Mistake #1: Relying on Individuals and Human-Driven Processes
A situation we have seen many times flows a bit like this:
- It is the midst of a crisis – were you lucky enough to get Slashdotted?
- A change is made to your data center equipment – a new volume is added to your NetApp so that it can serve as high-speed storage for your web tier.
- Moving quickly, you forget to add the new volume to your NetApp monitoring.
Post-crisis, everyone is too busy breathing sighs of relief to worry about that new volume. It slowly but surely fills up or starts exhibiting latency, due to high IO operations. No one is alerted, and customers are the first to notice, call in, and complain. Quite possibly, the CTO is the next to call.
Remove human configuration as much as possible – not just because it saves people time, but because it makes monitoring – and hence the services monitored – that much more reliable.
When looking at solution features, consider the following:
- It examines monitored devices continually for modifications, automatically adding new volumes, interfaces, docker containers, Kubernetes pods, load balancer VIPs, databases, and any other changes into monitoring. It then informs you via instant message in real-time or batched notifications, whatever you prefer.
- It provides filtering and classification of discovered changes to avoid alert overload.
- It scans your subnets, or even your hyperscale cloud account, and automatically adds new machines or instances to monitoring.
- It graphs and spontaneously creates intelligent dashboards. A dashboard graph based on the sum of the sessions on ten web servers used to view the health of your service should automatically update when you add four more servers. Automation of this collection and representation ensures the continuity of your business overview.
Do not depend on manual monitoring updates to cover adds, moves, and changes.
Mistake #2: Considering an Issue Resolved When Monitoring Cannot Detect Recurrence
Outages occur, even when you follow good monitoring practices. An issue is not resolved, though, without ensuring monitoring detects the root cause or is modified to provide early warning.
For example, a Java application experiencing a service-affecting outage due to a large number of users overloading the system probably exhibited an increase in the number of busy threads. Modify your JMX monitoring to watch for this increase. If you create an alert threshold on this metric or use a monitoring platform that supports dynamic thresholds, you can receive an advanced warning next time. Early warning at least provides a window in which to avoid the outage: time to add another system to share the load or activate load-shedding mechanisms. Configuration of alerts in response to downtime allows you to be proactive next time. The next time you experience an outage, the root cause should never point to a repeated preventable event.
This is a very important principle. Recovery of service is the first step, it does not mean the issue should be closed or dismissed. You need to be satisfied with the warnings your monitoring solution gave before the issue, and content with the alert types and escalations that triggered during the issue. The issue may be one with no way to warn in advance – catastrophic device failure does occur – but this process of evaluation should be undertaken for every service-impacting event.
Mistake #3: Alert Overload
Alert overload and fatigue is one of the most detrimental conditions. Too many alerts that are triggered too frequently result in people tuning out all alerts. You may run into a situation where critical production service outage alerts get pushed to scheduled downtime for eight hours due to the admin assuming it was another false alert.
You must prevent this by:
- Adopting sensible escalation policies that distinguish between warnings and error or critical alert levels. There may be no need to wake people if NTP is out of sync, but if the primary database volume is seeing 200ms latency and transaction time is 18 seconds for an end-user, that is critical. You need to be on it, no matter the time.
- Routing the right alerts to the right people. Don’t alert the DBA about network issues and do not tell the networking group about a hung transaction.
- Tuning your thresholds. Every alert must be real and meaningful. Tune the monitoring to get rid of false positives or alerts triggered on test systems.
- Investigating alerts triggered when everything seems okay. If you find there was no outward issue, adjust thresholds or disable the alert.
- Ensuring alerts are acknowledged, resolved, and cleared. Hundreds of unacknowledged alerts are too difficult to allow easy parsing of an immediate issue. Use alert filtering to view only the groups of systems for which you are responsible.
- Analyzing your top alerts by host or by alert type. Investigate to see whether remedying issues in monitoring, systems, or operational processes can reduce the frequency of these alerts.
Mistake #4: Monitoring System Sprawl
You only need one monitoring system. Do not deploy a monitoring system for windows servers, another for Linux, another for MySQL, and another for storage. Even if each system is highly functional and capable, having multiple systems guarantees suboptimal datacenter performance. Your teams need one place to monitor as many different technologies as possible, ensuring they are ‘singing from the same hymn sheet’ and not pointing fingers at each other. It’s often tempting to use the tools that are available from the technology vendor, this means your teams will be logging into different platforms and reviewing a skewed view of the situation.
A central location to store your team’s contact details is also vital. You do not want up-to-date information in the escalation methods of two systems but not the other two. You do not want maintenance correctly scheduled in one monitoring system but not in the one used to track other components of the same systems. You will experience incorrectly routed alerts, ultimately resulting in alert overload. A system that notifies people about issues they cannot acknowledge leads to ‘Oh… I turned my cell phone notification off.’
A variant of this problem is when your SysAdmins and DBAs automate things by writing cron jobs or stored procedures to check and alert on issues. The first part is great – the checking. However, alerting should happen through your monitoring system. Just have the monitoring system run the script and check the output, call the stored procedure, or read the web page. You do not want yet another place to adjust thresholds, acknowledge alerts, deal with escalations, and so on. Locally run, one-off alerting hacks will not incorporate these monitoring system features. Furthermore, this approach creates monitoring silos, where some team members have data and alerts, and some do not.
Mistake #5: Not Monitoring Your Monitoring System
Your monitoring solution can fail. Ignoring this fact only leaves you exposed. Companies invest significant capital to set up monitoring and understand the recurring cost in staff time, but then fail to monitor the system. Who knows when a hard drive or memory failure occurs, an OS or application crash happens, a network outage at your ISP, or power failure? Don’t let your monitoring system leave you blind to the health of the rest of your infrastructure. The monitoring system encompasses the complete system, including the ability to send alerts. If the outgoing mail and SMS message delivery connection is down, your monitoring system might detect an outage, but it is only apparent to staff watching the console. A system that cannot send alerts is not helping.
False security is worse than having no monitoring system at all. If you do not have a monitoring system, you know you need to execute manual health checks. If you have an unmonitored system that is down, you’re not executing health checks and you’re unwittingly exposing the business to an undetected outage. If your teams develop a lack of faith in the reliability of your monitoring tool, they may well start to question the validity of alerts it produces.
Minimize your risk by configuring a check of your monitoring system from a location outside the reach of the monitoring system. Or, select a monitoring solution that’s not only hosted in a separate location but also checking the health of their own monitoring solution from multiple locations.
The best way to address all of these mistakes is to find a comprehensive monitoring platform that does the work for you. LogicMonitor is a cloud-based monitoring platform that enables organizations to see what’s coming before it happens. With advanced AIOps features, powered by LM Intelligence, LogicMonitor helps teams proactively identify and resolve IT infrastructure issues before they can adversely affect business-critical systems and end-user performance. If you are interested in learning more about LogicMonitor’s capabilities, see our platform demo or request a free trial.