Enterprise organizations know that they need monitoring to ensure site uptime and keep their business running. Yet, many sites still suffer from outages first reported by their customers. This is often due to small mistakes made with monitoring systems. These monitoring mistakes are easy to make and even easier to overlook, but the consequences are detrimental. Here are some of the most common monitoring mistakes and how LogicMonitor addresses them.
This guide covers:
- The five most common monitoring mistakes
- Best monitoring practices to prevent outages and alert storms
- Proactive strategies for monitoring
Download the guide
Below we have provided the first two mistakes of this guide. To read the full guide, fill out the form above to download.
Mistake #1: Relying on Individuals and Human-Driven Processes
A situation we have seen many times flows a bit like this:
- It is the midst of a crisis – were you lucky enough to get Slashdotted?
- A change is made to your data center equipment – a new volume is added to your NetApp so that it can serve as high-speed storage for your web tier.
- Moving quickly, you forget to add the new volume to your NetApp monitoring.
Post-crisis, everyone is too busy breathing sighs of relief to worry about that new volume. It slowly but surely fills up or starts exhibiting latency, due to high IO operations. No one is alerted, and customers are the first to notice, call in, and complain. Quite possibly, the CTO is the next to call.
Remove human configuration as much as possible – not just because it saves people time, but because it makes monitoring – and hence the services monitored – that much more reliable.
When looking at solution features, consider the following:
- It examines monitored devices continually for modifications, automatically adding new volumes, interfaces, docker containers, Kubernetes pods, load balancer VIPs, databases, and any other changes into monitoring. It then informs you via instant message in real-time or batched notifications, whatever you prefer.
- It provides filtering and classification of discovered changes to avoid alert overload.
- It scans your subnets, or even your hyperscale cloud account, and automatically adds new machines or instances to monitoring.
- It graphs and spontaneously creates intelligent dashboards. A dashboard graph based on the sum of the sessions on ten web servers used to view the health of your service should automatically update when you add four more servers. Automation of this collection and representation ensures the continuity of your business overview.
Do not depend on manual monitoring updates to cover adds, moves, and changes.
Mistake #2: Considering an Issue Resolved When Monitoring Cannot Detect Recurrence
Outages occur, even when you follow good monitoring practices. An issue is not resolved, though, without ensuring monitoring detects the root cause or is modified to provide early warning.
For example, a Java application experiencing a service-affecting outage due to a large number of users overloading the system probably exhibited an increase in the number of busy threads. Modify your JMX monitoring to watch for this increase. If you create an alert threshold on this metric or use a monitoring platform that supports dynamic thresholds, you can receive an advanced warning next time. Early warning at least provides a window in which to avoid the outage: time to add another system to share the load or activate load-shedding mechanisms. Configuration of alerts in response to downtime allows you to be proactive next time. The next time you experience an outage, the root cause should never point to a repeated preventable event.
This is a very important principle. Recovery of service is the first step, it does not mean the issue should be closed or dismissed. You need to be satisfied with the warnings your monitoring solution gave before the issue, and content with the alert types and escalations that triggered during the issue. The issue may be one with no way to warn in advance – catastrophic device failure does occur – but this process of evaluation should be undertaken for every service-impacting event.