Sometimes, the hardest things are not technical. They’re a result of the politics of working in large organizations, with different groups with different responsibilities. Sometimes this fragmentation of responsibilities allows 82 year old grandmothers to walk up to the walls of supposedly ultra-secure nuclear weapons facilities. And the lessons we learn from this case can apply just as much to your I.T. monitoring.
The New Yorker printed a story about some pacifists who broke in to a nuclear weapons facility, simply cutting wires, walking over open ground, cutting more fences, until they were able to reach the nuclear weapons storage facility, and take to it with sledgehammers.
An excerpt from the article reads:
On the night of the Y-12 break-in, a camera that would have enabled security personnel to spot the intruders was out of commission. … about a fifth of the cameras on the fences surrounding the Protected Area were not working that night. One camera did capture someone climbing through a fence. But the security officer who might have seen the image was talking to another officer, not looking at his screen. Cameras and motion detectors at the site had been broken for months. The security equipment was maintained by Babcock & Wilcox, a private contractor that managed Y-12, while the officers who relied on the equipment worked for Wackenhut. Poor communication between the two companies contributed to long delays whenever something needed to be fixed. And it wasn’t always clear who was responsible for getting it fixed. The Plowshares activists did set off an alarm. But security officers ignored it, because hundreds of false alarms occurred at Y-12 every month. Officers stationed inside the uranium-storage facility heard the hammering on the wall. But they assumed that the sounds were being made by workmen doing maintenance.
The salient points here, as they relate to I.T. monitoring, are:
- hundreds of false alarms, causing real alarms to be ignored
- no alerting on significant events, instead relying on people (cameras showing people climbing through a fence only showed the image, but did not alert – akin to a graph showing a CPU spiking to 100%, but not alerting, so if you aren’t watching you miss it)
- no expectation that service impacting work will be communicated in advance (“Hammering on a wall? Someone must be doing work again.”). In the monitoring world, this wall should have been placed in Scheduled Downtime. (In the real world, this should have been communicated.)
- a split between those using the monitoring, and those maintaining it, resulting in a corporate culture that accepted many false alerts.
So some of these issues are technically easy to deal with, simply by using best practices in alert management:
- Scheduling downtime before planned work
- Sending the right alerts to the right people at the right time. (Send camera down alerts to the maintenance team directly, for example)
- Implementing a process to regularly review the top sources of alerts, and adjust thresholds; fix underlying problems; or disable alerts, to eliminate false alerts, and ensure all alerts are meaningful.
However, while these are technically easy, the harder problem is often the split in responsibilities. The monitoring group can point out repeated alerts – but if correcting the issue, or even determining if the issue is a false alert, requires help from other teams (DBAs to increase cache size to prevent cache thrashing, for example), and the other group is not invested in the monitoring – then conditions can continue unchanged.
How do you influence people to be invested in the monitoring? The nuclear weapons facility unified the contractors, so that the people responding to the monitoring and those maintaining the monitoring and sensors were the same company. In the I.T. world, one approach often used is to make people that can control the accuracy of monitoring feel the pain of the monitoring. Have the people that can adjust the DB cache receive the alerts about it. Similarly, if the monitoring failed to detect issues in advance due to insufficient instrumentation on the application – have the outage calls go the developers directly. (A few 2 a.m. calls tend to motivate people to correct issues – but this assumes that they take the calls…)
Another approach is to have alert statistics (alerts triggered; duration of alerts; alert severities, etc) surfaced to management. If monitoring is a separate team, that cannot control database configurations, but management is aware that the system reports that 84% of alerts are DB related, appropriate actions can be directed to the right teams.
Both of these approaches are really a manifestation of a cultural issue: the company culture has to be such that there is a buy in that monitoring is important. It will naturally follow from this that all alerts are important – either as real alerts, or as false alerts that should be quickly rectified. We’ve seen successful adoption of a ‘monitoring-first’ culture both top-down, when mandated by management, and bottom-up, when teams have thoroughly adopted monitoring, and other teams see the benefits and increase in development throughput and availability that results.
If you have seen your company adopt a monitoring-first culture, or it currently has one – let us know your thoughts on how that happened.