Fortune Telling: The Art of Preventative Maintenance

Being on-call can suck. Getting paged is no fun at 3PM and even less fun at 3AM. It doesn’t do anything beneficial for you or your partner’s beauty rest. The only thing worse than being on-call, is handing off a known problem, a.k.a. a “time-bomb,” to the next person in line.

The LogicMonitor Operations team has an ongoing, internal competition for who can hand off on-call with the fewest active alerts. This makes it possible for each on-call engineer to give attention both to any alerts that require immediate attention, along with any warnings that indicate that things are heading in the alerting direction. The more issues that are dealt with ahead of time equals more sleep and less stress.

Working hard to eliminate issues is great, but what do you do when your finance team needs advance time to approve server purchases? Can you predict how long it will be until you need more capacity? Can you predict how long you will have before that warning is waking you up in the middle of the night as a full fledged problem? Enter forecasting. I’m not talking about palm reading and tarot cards. Rather, I’m referring to a sophisticated mathematical model which gives an approximation of future occurrences based on past data.

A few weeks ago it was my turn to be on-call. While hunting wild warnings in pursuit of a new record and a good nights sleep, I noticed that a couple of machines were flapping in and out of alert mode for SSD utilization:
Screenshot 2015-11-17 15.53.50

This seemed normal, but since this application is known to be IO bound, I did some additional investigation by looking at the trend over a longer time period:

Screenshot 2015-11-17 15.54.14

Over a longer time period, it started to look like it was increasing. So then I ran a forecast to see just how much time this system had until things break:

Screenshot 2015-11-17 15.55.09

The forecast indicated that we didn’t have to worry about this breaking on Christmas day, as by that time Tech Ops would have been paged. On the upside, I now was able to schedule maintenance on this machine, which I knew would be completed before Christmas. I may not have broken the record for fewest number of active alerts, but at least I knew that whoever is on-call that week will sleep a little easier. As will I.

This is just a glimpse into our initial successes with forecasting testing – stay tuned for a much larger rollout coming soon.