Most organizations know that they need monitoring to ensure site uptime and keep their business running. Yet, many sites still suffer from outages first reported by their customers due to small mistakes made with their monitoring systems. Monitoring mistakes are easy to make and easy to overlook, but the consequences can be detrimental. Here are some of the most common monitoring mistakes and how to address them.
A situation we have seen many times flows a bit like this:
Post-crisis, everyone is too busy breathing sighs of relief to worry about that new volume. It slowly but surely fills up or starts exhibiting latency, due to high IO operations. No one is alerted, and customers are the first to notice, call in, and complain. Quite possibly, the CTO is the next to call.
Remove human configuration as much as possible – not just because it saves people time, but because it makes monitoring – and hence the services monitored – that much more reliable.
When looking at solution features, consider the following:
Do not depend on manual monitoring updates to cover adds, moves, and changes.
Outages occur, even when you follow good monitoring practices. An issue is not resolved, though, without ensuring monitoring detects the root cause or is modified to provide early warning.
For example, a Java application experiencing a service-affecting outage due to a large number of users overloading the system probably exhibited an increase in the number of busy threads. Modify your JMX monitoring to watch for this increase. If you create an alert threshold on this metric or use a monitoring platform that supports dynamic thresholds, you can receive an advanced warning next time. Early warning at least provides a window in which to avoid the outage: time to add another system to share the load or activate load-shedding mechanisms. Configuration of alerts in response to downtime allows you to be proactive next time. The next time you experience an outage, the root cause should never point to a repeated preventable event.
This is a very important principle. Recovery of service is the first step, it does not mean the issue should be closed or dismissed. You need to be satisfied with the warnings your monitoring solution gave before the issue, and content with the alert types and escalations that triggered during the issue. The issue may be one with no way to warn in advance – catastrophic device failure does occur – but this process of evaluation should be undertaken for every service-impacting event.
Alert overload and fatigue is one of the most detrimental conditions. Too many alerts that are triggered too frequently result in people tuning out all alerts. You may run into a situation where critical production service outage alerts get pushed to scheduled downtime for eight hours due to the admin assuming it was another false alert.
You must prevent this by:
You only need one monitoring system. Do not deploy a monitoring system for windows servers, another for Linux, another for MySQL, and another for storage. Even if each system is highly functional and capable, having multiple systems guarantees suboptimal datacenter performance. Your teams need one place to monitor as many different technologies as possible, ensuring they are ‘singing from the same hymn sheet’ and not pointing fingers at each other. It’s often tempting to use the tools that are available from the technology vendor, this means your teams will be logging into different platforms and reviewing a skewed view of the situation.
A central location to store your team’s contact details is also vital. You do not want up-to-date information in the escalation methods of two systems but not the other two. You do not want maintenance correctly scheduled in one monitoring system but not in the one used to track other components of the same systems. You will experience incorrectly routed alerts, ultimately resulting in alert overload. A system that notifies people about issues they cannot acknowledge leads to ‘Oh… I turned my cell phone notification off.’
A variant of this problem is when your SysAdmins and DBAs automate things by writing cron jobs or stored procedures to check and alert on issues. The first part is great – the checking. However, alerting should happen through your monitoring system. Just have the monitoring system run the script and check the output, call the stored procedure, or read the web page. You do not want yet another place to adjust thresholds, acknowledge alerts, deal with escalations, and so on. Locally run, one-off alerting hacks will not incorporate these monitoring system features. Furthermore, this approach creates monitoring silos, where some team members have data and alerts, and some do not.
Your monitoring solution can fail. Ignoring this fact only leaves you exposed. Companies invest significant capital to set up monitoring and understand the recurring cost in staff time, but then fail to monitor the system. Who knows when a hard drive or memory failure occurs, an OS or application crash happens, a network outage at your ISP, or power failure? Don’t let your monitoring system leave you blind to the health of the rest of your infrastructure. The monitoring system encompasses the complete system, including the ability to send alerts. If the outgoing mail and SMS message delivery connection is down, your monitoring system might detect an outage, but it is only apparent to staff watching the console. A system that cannot send alerts is not helping.
False security is worse than having no monitoring system at all. If you do not have a monitoring system, you know you need to execute manual health checks. If you have an unmonitored system that is down, you’re not executing health checks and you’re unwittingly exposing the business to an undetected outage. If your teams develop a lack of faith in the reliability of your monitoring tool, they may well start to question the validity of alerts it produces.
Minimize your risk by configuring a check of your monitoring system from a location outside the reach of the monitoring system. Or, select a monitoring solution that’s not only hosted in a separate location but also checking the health of their own monitoring solution from multiple locations.
The best way to address all of these mistakes is to find a comprehensive monitoring platform that does the work for you. LogicMonitor is a cloud-based monitoring platform that enables organizations to see what’s coming before it happens. With advanced AIOps features, powered by LM Intelligence, LogicMonitor helps teams proactively identify and resolve IT infrastructure issues before they can adversely affect business-critical systems and end-user performance. If you are interested in learning more about LogicMonitor’s capabilities, see our platform demo or request a free trial.
Stuart Carrison is an employee at LogicMonitor.
Subscribe to our LogicBlog to stay updated on the latest developments from LogicMonitor and get notified about blog posts from our world-class team of IT experts and engineers, as well as our leadership team with in-depth knowledge and decades of collective experience in delivering a product IT professionals love.
Optm, a technology partner, discusses their relationship with LogicMonitor, as well as industry trends.
LogicMonitor announced the appointment of Nitin Navare as Chief Technology Officer (CTO).
There are a few Agile certifications available to choose from, and in this article, we’ll discuss the best agile certifications currently available for IT professionals.