So as everyone knows, Amazon rebooted virtually all EC2 instances in December. They emailed people to notify them, but not everyone read the emails, leading to Amazon performing the reboots on their own schedule, with the customers unaware.
For some SaaS companies, this resulted in many hours of downtime. For others, there was a short impact. What was the difference?
Time to Notification
One of the main differentiators was where monitoring was running. Some companies ran their own monitoring within EC2 – so when their instances were rebooted, so was their monitoring. If it did not automatically recover, they had several hours of outage before they even knew they had an issue. (As their monitoring was down, they had nothing to alert them to the rest of their infrastructure being down too.)
Time to Remediation
Once people knew their servers were down, those with monitoring hosted at EC2 knew they had an issue, but were blind as to the status of their other systems until they recovered their monitoring systems. In some cases we heard about, this added several hours to the recovery time. Those using monitoring as a service had immediate visibility into what systems and services had recovered, and which had not, and needed their attention – then they could quickly focus on recovering their databases, or redis systems, or whatever was needed.
The cloud or your datacenter?
And you want your monitoring to be available immediately, as soon as the power or network recovers, so you can know what to focus on to restore service.
Organize to speed remediation
Practice
Practice, as the saying goes, makes perfect. Run DR drills. Spin up a cloud based replica, and rudely shut it down. Make sure you know the order of your system’s dependencies. See how long it takes you to recover. Get used to using your monitoring to guide you in what to address next.
Now imagine how long it would take to recover without monitoring visibility.
Key takeaway: Move your monitoring away from a premise based system, onto Monitoring as a Service, before your next datacenter or cloud impacting event.