Eleanor Roosevelt is reputed to have said "Learn from the mistakes of others. You can’t live long enough to make them all yourself." In that spirit, we're sharing a mistake we made so that you may learn.
This last weekend we had a service impacting issue for about 90 minutes, that affected a subset of customers on the East coast. This despite the fact that, as you may imagine, we have very thorough monitoring of our servers; error level alerts (which are routed to people's pagers) were triggered repeatedly during the issue; we have multiple stages of escalation for error alerts; and we ensure we always have on-call staff responsible for reacting to alerts, who are always reachable.
All these conditions were true this weekend, and yet we still had an issue whereby no person was alerted for over an hour after the first alerts were triggered. How was this possible?
It was (as most failures are) the result of multiple conditions interacting.
Firstly, the primary on call engineer for the week (stage 1 in the escalation chain) was going to be unreachable for 24 hours. He knew this, but, instead of moving himself to stage 2, he instead just informed his stage 2 counterpart he'd be out of touch, and relied on the escalations of LogicMonitor's alerting system. This did mean that such alerts would be delayed by 5 minutes, waiting for the 5 minute resend interval to escalate to the next stage, but he thought this was an acceptable risk - it is rare that our services have issues, and 5 minutes extra did not seem too long a delay in the unlikely event of an alert.
Secondly, the error that occurred was like a 'brown-out' condition. An error was triggered, an alert sent to stage 1 - but before the 5 minute escalation interval passed and the error was sent to the stage 2 engineer, the condition cleared, and service recovered for a few minutes. Then the condition recurred, an alert was sent, then cleared again. This pattern, of a few minutes of service impact that cleared before the alerts could be escalated to stage 2, repeated over and over. Only once the alert happened to persist for long enough to escalate to stage 2 was an alert sent to someone that could respond. (Which he did within minutes, restarted the problematic service, and restored service.)
Thirdly, this occurred on Sunday. Had it occurred mid-week, it would have been noticed regardless, via our dashboards or our Hip Chat integration, which posts alerts of error or critical to our Tech Ops chat room. (More on that later.)
How to prevent this in the future? This is where you get to learn from our mistakes. There's two ways to ensure that short alert/alert clear cycles don't prevent alerts from reaching the right people, at the right time. The first is a simple process one - ensure your stage 1 alert recipients are going to be available. (Kind of a 'duh' process in retrospect...) The second: for service affecting alerts, ensure that the Alert Clear Interval is set long enough that the alert will not clear before the escalation interval.
If we'd have had the Alert Clear Interval set to at least 6 minutes, then even if the raw data was no longer triggering the alert, the alert would still be in effect at the 5 minute escalation period. The stage 2 engineer would have received the very first alert. (If he was not in front of a computer, he may well not have taken any action, as he'd have received an alert clear a minute later - but he then would have received a second alert the next time the issue occurred, a short time later, and would definitely have investigated, and resolved the issue.)
Having your Alert Clear Interval set to at least as long as your escalation interval will ensure that if your stage 1 drops the ball (or even is intentionally planning on dropping the ball), then short 'flapping' alerts will still be brought to the attention of stage 2. Of course, if stage 1 acknowledges, schedules down time, or otherwise deals with the alerts, then stage 2 can remain unaware, as the alerts won't be escalated to them at all.
(Another option: always ensure there are multiple people contacted on each stage, but then there is a problem of diffusion of responsibility, so we don't recommend that in general.)
We'll be updating our Alert Response Best Practices document to reflect this, but figured we'd get the word out sooner via the blog.