We’ve noted before that one of the problems that we frequently run into with enterprises’ monitoring practices is that they have too many alerts triggering. While technically this problem can be solved using best practices (i.e. scheduling downtime on active alerts until they can be addressed, tuning thresholds, reporting on most frequent alerts and addressing them first, etc.), sometimes a split in team responsibilities makes it hard to solve. If the team running the monitoring does not know an appropriate threshold or know whether an alert is actionable or not, what then?
We’ve talked before about this – using a security incident at a nuclear weapons facility as an example of poor practices. In contrast, the U.S. Navy nuclear program can be held up as an example of good practices. The Harvard Business Review published an article about how the Department of Defense adopted practices from the U.S. Navy’s nuclear-propulsion program to improve cybersecurity – but the same principles also apply to IT monitoring practices.
Distilled down, here are the best practices from the HBR article that also apply to monitoring:
- Take charge: If you want to operate a high-reliability organization, ensure that message comes from the top and is clear. Your IT assets are probably sitting in some of the most expensive real estate that the enterprise operates (datacenters). The company chose to invest in the procurement process, hardware, backup and DR for these systems, as well as the purchase or development of the software. They wouldn’t be running the datacenters if they did not add value to their organization. Not all systems will be of equal value to the company, but all should be monitored. And if alerts aren’t configured correctly, alerts from less valuable systems can flood those from critical systems – increasing costly time to resolution.
- Make everyone accountable: Create expectations at all levels that monitoring must be complete (i.e. all service affecting issues should trigger alerts) and actionable (i.e. no repeated false alerts). Set up dashboards and reporting by departments reflecting number and duration of alerts – and make managers accountable.
- Institute training: Establish systematic training on the why and how of monitoring hygiene to help everyone understand its importance – as well as understand how to do it.
- Have consequences: Issues that do not trigger appropriate alerts will occur once in a while as new software is deployed. However, an incident should never be regarded as closed until it has been confirmed that monitoring detected it appropriately and has notified the correct people. If that did not occur, ensure it will happen next time before closing the incident. Failure to do so will lead to future undetected outages, which should have consequences for the people – not just the organization.
If your enterprise is running or using IT infrastructure that is providing value to your organization, you should make the decision to have a monitoring system in place. That’s half the battle – the next step is to ensure that you put the organizational practices in place to make sure the monitoring delivers all the value it should.