Continuing on from Part 1
No issue should be considered resolved if monitoring will not detect its recurrence.
Even with good monitoring practices in place, outages will occur. Best practices dictate that the issue not be considered resolved until monitoring is in place to detect the root cause, or provide earlier warning. For example, if a Java application experiences a service affecting outage due to a large number of users overloading the system, the earliest warning of an impending issue may be an increase in the number of busy threads, which can be tracked via JMX monitoring. An alert threshold should be placed on this metric, to give advance warning before the next event, which could allow time to add another system to share the load, or activate load shedding mechanisms, and so on. (LogicMonitor automatically includes alerts for JMX enabled applications such as Tomcat and Resin when the active threads are approaching the maximum configured – but such alerts should be present for all applications, on all monitoring systems.)
This is a very important principle – just because things are working again, it does not mean issues should be closed unless you are happy with the warning your monitoring gave about the issue before it started, or the kind of alerts and alert escalations that occurred during the issue. It’s possible that the issue is one with no way to warn in advance (for example, sudden panic of a system), but this process of evaluation should be undertaken for every service impacting event.
Steve Francis is an employee at LogicMonitor.
Subscribe to our LogicBlog to stay updated on the latest developments from LogicMonitor and get notified about blog posts from our world-class team of IT experts and engineers, as well as our leadership team with in-depth knowledge and decades of collective experience in delivering a product IT professionals love.