Disaster Recovery Journal
By Steve Francis
In late October of 1980, the precursor to the modern internet (ARPAnet) failed for an agonizing four hours. Considered to be the first major network outage, engineers scrutinized the failure in hopes of preventing future shutdowns. A lot has evolved since then. The modern internet has nearly four billion users participating in more than $2 trillion of ecommerce. Yet, we still see frequent outages. In January of 2017, United had to ground all domestic flights because of an outage, leading to customer fury and bad PR. United is not alone: across industries, organizations like Starbucks, Facebook, and Microsoft have also suffered downtime in the last 18 months.
Outages are costly. Organizations lose revenue from outages not just from immediate lack of business, but also from degraded consumer trust. Consumers are often unforgiving - according to BCG, 28 percent of customers never return to a company's website if it doesn't perform well. The longer an outage lasts, the more the number of unhappy customers increases – 33 percent of organizations to estimate that every hour of downtime leads to $1M in lost revenue.
So, what causes these failures and why should organizations care?
Understanding and Avoiding Outages
To better avoid outages – and lost revenue – it’s vital to understand how they work. A 2016 study from the Ponemon Institute identified four key factors that contribute to outages:
- UPS failure
- Malicious actors
- Accidental insider actions
- Environmental issues
Crucially, two of the top factors provide an advance failure warning: UPS batteries slowly degrade and rack temperatures steadily rise. Both of these situations can be alerted on in advance – indicating that improved monitoring can prevent outages.
In the other mentioned cases, outages can occur suddenly and without warning. In such cases, it’s vital to detect the failure quickly, and know the impacted systems. Once identified, organizations should have processes in place to rapidly mitigate the issue – reducing downtime and lost revenue.
In both cases, it is clear that proper monitoring is one of the most important things organizations can do to reduce outage susceptibility.
6 Fundamentals of Proper Monitoring
First, organizations should monitor their entire IT infrastructure from a single pane of glass. By monitoring all aspects of the devices and systems you control – hardware and software, on-premise and in the cloud – you’ll have the full picture of system health at all times. It is too difficult to maintain, learn, and operate separate tools – much easier to have a single tool. Further, when applications span many systems, hardware and cloud services, it is easier to isolate the cause of a problem when you can correlate information across systems.
Second, spot trends. With a full suite of monitoring, organizations will be able to track vital system trends – from rack temperature to UPS battery issues and power draw, to increasing database cache misses – to proactively identify future failures and fix before outages occur.
Third, expect outages. Even the best systems will suffer outages. Rather than trying to prevent all outages and being slow to react when hit with an unexpected failure – focus on outage detection and rapid remediation. Design systems to expect, tolerate and recover from, or accommodate, failures. (Netflix runs code that deliberately kills some small number of production systems, to ensure developers design applications that expect outages and can keep functioning.)
Fourth, make sure your monitoring solution supports out-of-band communications. Otherwise, if your datacenter goes down, your monitoring solution won’t be able to tell you!
Fifth, automate monitoring set-up as you roll out new devices. Using an “infrastructure as code” methodology, you can vastly reduce the manual set up required, ensuring higher efficiency and making sure every device is monitored from the second you roll it out.
Finally, integrate your monitoring with key IT systems. To illustrate, let’s examine two primary use cases:
1. Alerts. Integrating with programs such as Slack means your alerts use the communication system your IT staff is already using, enabling them to see alerts faster.
2. Trouble tickets. Integrating with ITSM solutions means you can create trouble tickets automatically, which reduces latency in terms of responding to trouble.
These are not the only two integration use cases, but rather just two examples of how integrating your monitoring solutions with your other IT tools makes things better, by adopting tools your staff already has workflows around.
Preparing for Outages
Unfortunately, outages are here to stay. Increasingly complex and dynamic systems have led to more frequent and more expensive failures in recent years – so it’s vital to have a contingency plan for when your system goes down. Implementing proper monitoring is the best way to preempt many outages and mitigate the impact of the rest.