How to minimize the impacts of the next Amazon reboot… or of your own datacenter failure

So as everyone knows, Amazon rebooted virtually all EC2 instances in December.  They emailed people to notify them, but not everyone read the emails, leading to Amazon performing the reboots on their own schedule, with the customers unaware.

For some SaaS companies, this resulted in many hours of downtime. For others, there was a short impact. What was the difference?

Time to Notification

One of the main differentiators was where monitoring was running.  Some companies ran their own monitoring within EC2 – so when their instances were rebooted, so was their monitoring. If it did not automatically recover, they had several hours of outage before they even knew they had an issue. (As their monitoring was down, they had nothing to alert them to the rest of their infrastructure being down too.)

Time to Remediation

Once people knew their servers were down, those with monitoring hosted at EC2 knew they had an issue, but were blind as to the status of their other systems until they recovered their monitoring systems.  In some cases we heard about, this added several hours to the recovery time.  Those using monitoring as a service had immediate visibility into what systems and services had recovered, and which had not, and needed their attention – then they could quickly focus on recovering their databases, or redis systems, or whatever was needed.

The cloud or your datacenter?

Of course, exactly the same issues apply to outages of your own datacenter.  Even the best run datacenters can lose power or network connectivity. It should never happen with A/B power and redundant networks – but it does.  Having Monitoring as a service is just as important whether you have one datacenter (which is analogous to an EC2 region) or multiple.  You want your monitoring outside all your datacenters (otherwise no doubt the one hosting the monitoring will be the one that fails, following the law of toast falling jam side down).

And you want your monitoring to be available immediately, as soon as the power or network recovers, so you can know what to focus on to restore service.

Organize to speed remediation

Your infrastructure has dependencies. No point in trying to bring up your database if its data is stored on a NetApp that is in a cluster failure state.  So have a view in your monitoring that lets you see things by functional groups, and lets you assess whether the prerequisite systems are up.

Practice

Practice, as the saying goes, makes perfect.  Run DR drills.  Spin up a cloud based replica, and rudely shut it down. Make sure you know the order of your system’s dependencies.  See how long it takes you to recover. Get used to using your monitoring to guide you in what to address next.

Now imagine how long it would take to recover without monitoring visibility.

Key takeaway: Move your monitoring away from a premise based system, onto Monitoring as a Service, before your next datacenter or cloud impacting event.