When Lightning Strikes Your Cloud, Good Monitoring Means Great Disaster Recovery

Kablooee!  That was the sound I (and many others) heard coming from one of Amazon Web Services (aka, the “cloud”) availability zones in Northern Virginia on June 30th (https://venturebeat.com/2012/06/29/amazon-outage-netflix-instagram-pinterest/, https://gigaom.com/cloud/some-of-amazon-web-services-are-down-again/).  The sound was a weather-driven event causing one of Amazon’s data centers to lose power.  And what happens when a data center loses power (and, for unspecified reasons, UPSs and generators don’t kick in)?  Crickets.  Computers turn off.  Lights stop blinking.  The “sounds of silence” (but not how Simon and Garfunkel sing about it).

By this point, you either have your monitoring outside your datacenter, and were notified about the outage, or only became aware belatedly, and regretted the decision not to put monitoring outside. But what happens after power has been restored?  Well, that’s when good monitoring comes into play yet again…

As much hype as there has been surrounding “clouds” and “cloud computing” (and for good reason – they are changing the face of infrastructure), “clouds” are still a bunch of computers sitting in some data center – somewhere – requiring power, cooling, etc.

One of the nice things about going with a cloud service for your infrastructure is you are largely removed from needing to monitor hardware – this is all (presumably) done for you.  No having to worry about fan speeds, system board temperatures, power supplies, RAID status, etc.  However,  this doesn’t alleviate the need for good and intricate monitoring of your application “stack”.  This is everything else that makes your applications go — databases, JVM statistics, Apache status, system CPU, disk IO performance, system memory, application response time, load balancer health, etc etc.  This is the real guts of your organization – and the things that you need to know are working after a reboot.  And whether you are in the cloud or not, at some point all your systems are going to be rebooted.  I guarantee it, so plan for it.

So what happens when your environment does reboot?  It doesn’t matter whether you are in the cloud or not, when power is restored you need to make sure all the components of your software stack are back up.  Across all of your systems.  Hopefully your disaster recovery plan does not revolve around a single “hero” sysadmin who merely needs to be pulled away from an IRC chat, a MW3 campaign, or the bar (of the three, the last is the most worrisome).  Any available admin should be able to identify, via your monitoring system, what components of the stack came back up and are functioning, and which are not.  Your monitoring dashboard, listing all machines and services, is your eyes and ears – without it you are blind and dumb (so to speak.)  When all alerts have cleared from monitoring, you should be comfortable in knowing that service has been completely restored.  Good monitoring is by far the greatest safeguard you can have in making sure all systems are functioning again after a reboot, and in the shortest amount of time.

The take-home:  deploy good monitoring.  Make sure all aspects of your stack are monitored.  All of them.  When all of your machines are rebooted (at 3AM in the morning), how do you know all aspects of your stack are back up and functioning?  Good monitoring.  Good monitoring = LogicMonitor.  Check us out.  We eat our own dog food (see the next article on the “Leap Second” bug to get an account of this), and we are SaaS service, meaning if all your systems do reboot, your monitoring system is not a part of it.  We can help you recover faster from any outage, guaranteed.