Sometimes, the hardest things are not technical. They’re a result of the politics of working in large organizations, with different groups with different responsibilities. Sometimes this fragmentation of responsibilities allows 82 year old grandmothers to walk up to the walls of supposedly ultra-secure nuclear weapons facilities. And the lessons we learn from this case can apply just as much to your I.T. monitoring.
The New Yorker printed a story about some pacifists who broke in to a nuclear weapons facility, simply cutting wires, walking over open ground, cutting more fences, until they were able to reach the nuclear weapons storage facility, and take to it with sledgehammers.
An excerpt from the article reads:
On the night of the Y-12 break-in, a camera that would have enabled security personnel to spot the intruders was out of commission. … about a fifth of the cameras on the fences surrounding the Protected Area were not working that night. One camera did capture someone climbing through a fence. But the security officer who might have seen the image was talking to another officer, not looking at his screen. Cameras and motion detectors at the site had been broken for months. The security equipment was maintained by Babcock & Wilcox, a private contractor that managed Y-12, while the officers who relied on the equipment worked for Wackenhut. Poor communication between the two companies contributed to long delays whenever something needed to be fixed. And it wasn’t always clear who was responsible for getting it fixed. The Plowshares activists did set off an alarm. But security officers ignored it, because hundreds of false alarms occurred at Y-12 every month. Officers stationed inside the uranium-storage facility heard the hammering on the wall. But they assumed that the sounds were being made by workmen doing maintenance.
The salient points here, as they relate to I.T. monitoring, are:
So some of these issues are technically easy to deal with, simply by using best practices in alert management:
However, while these are technically easy, the harder problem is often the split in responsibilities. The monitoring group can point out repeated alerts – but if correcting the issue, or even determining if the issue is a false alert, requires help from other teams (DBAs to increase cache size to prevent cache thrashing, for example), and the other group is not invested in the monitoring – then conditions can continue unchanged.
How do you influence people to be invested in the monitoring? The nuclear weapons facility unified the contractors, so that the people responding to the monitoring and those maintaining the monitoring and sensors were the same company. In the I.T. world, one approach often used is to make people that can control the accuracy of monitoring feel the pain of the monitoring. Have the people that can adjust the DB cache receive the alerts about it. Similarly, if the monitoring failed to detect issues in advance due to insufficient instrumentation on the application – have the outage calls go the developers directly. (A few 2 a.m. calls tend to motivate people to correct issues – but this assumes that they take the calls…)
Another approach is to have alert statistics (alerts triggered; duration of alerts; alert severities, etc) surfaced to management. If monitoring is a separate team, that cannot control database configurations, but management is aware that the system reports that 84% of alerts are DB related, appropriate actions can be directed to the right teams.
Both of these approaches are really a manifestation of a cultural issue: the company culture has to be such that there is a buy in that monitoring is important. It will naturally follow from this that all alerts are important – either as real alerts, or as false alerts that should be quickly rectified. We’ve seen successful adoption of a ‘monitoring-first’ culture both top-down, when mandated by management, and bottom-up, when teams have thoroughly adopted monitoring, and other teams see the benefits and increase in development throughput and availability that results.
If you have seen your company adopt a monitoring-first culture, or it currently has one – let us know your thoughts on how that happened.
Steve is the founder of LogicMonitor.
Subscribe to our LogicBlog to stay updated on the latest developments from LogicMonitor and get notified about blog posts from our world-class team of IT experts and engineers, as well as our leadership team with in-depth knowledge and decades of collective experience in delivering a product IT professionals love.
Michael Tarbet (Global VP of Sales, MSP) and Steve Kahn (Area VP, Channel Sales, North America) would LogicMonitor would like to exclusively invite you to watch the Arizona Diamondbacks vs Los Angeles Dodgers from a premium suite at Chase Field in Phoenix May 26.
Join LogicMonitor for drinks and industry insights as we discuss the observability problems modern enterprises are facing.
Join LogicMonitor for a CiscoLive Dinner @ SushiSamba - June 14th, 2022