No one likes to talk about outages. They’re horrible to experience as an employee and they take a heavy toll in customer confidence and future revenue. But they do happen. Even publicly traded tech powerhouses, such as eBay and Microsoft, who have more technical resources than you’ll ever have, fall prey to outages. And when they do, they are closed for business, much to the chagrin of their shareholders and executive teams.
It’s not so much a question of whether an outage will occur in your company but when. The secret to surviving them is to get better at handling them and learning from the mistakes of others. Nobody is perfect all the time (my current employer, LogicMonitor, included) but I hope by talking about these mistakes, we can all begin the hard work required to avoid them in the future.
4 Massive Mistakes Companies Make Handling Outages:
- Not having a tried-and-true outage response planDoes this sound familiar?An outage occurs. A barrage of emails is fired to the Tech Ops team from Customer Support. Executives begin demanding updates every five minutes. Tech team members all run to their separate monitoring tools to see what data they can dredge up, often only seeing a part of the problem. Mass confusion ensues as groups point their fingers at each other and Sys Admins are unsure whether to respond to the text from their boss demanding an update or to continue to troubleshoot and apply a possible fix. Marketing (“We’re getting trashed on social media! We need to send a mass email and do a blog post telling people what is happening!”) and Legal (“Don’t admit liability!”) jump in to help craft a public-facing response. Cats begin mating with dogs and the world explodes.
OK, that last part may not happen. But if the rest sounds familiar, your company might be making Mistake #1.
A well-formed process for handling outages must define who is accountable for resolving issues, who is in the escalation path and who is responsible for communicating about issues. It includes a post-mortem process for analyzing the root cause behind the outage and addressing any gaps, which can range from building redundancy into systems to changing monitoring settings so that issues can be caught and resolved before an outage might reoccur in the future.
- Lack of communication about the outage with impacted customersIn the heat of trying to get your company back online, it’s easy to “go dark.” Unfortunately, not communicating with customers often causes a host of negative consequences, including a flood of support calls, longer hold times, and poor customer experience, and it can produce a perception that your company is unresponsive, untrustworthy or not in control.The fault often lies in poor or missing lines of communication between customer-facing groups and your Tech Ops team. Not having systems (blogs, forums, mass email, RSS feeds, etc) with which to notify customers of issues can be a big problem. Or companies don’t communicate about the outage based on the mistaken belief that customers might not notice the issue (customers will notice) and that damage will somehow be minimized (lack of communication only makes it worse.)Tip: Ensure you have a defined communication process in place with clearly assigned responsibilities for during and after the outage. Make sure everyone involved is familiar with it. Don’t just store it on your company’s web site, because that may not be accessible during the outage.
- Playing the blame game
Blaming a partner or vendor is a tactic companies sometimes employ in responding to outages. It rarely proves successful, in part, because customers see it as abdicating responsibility for a decision the company ultimately made. (Who chose to depend on that vendor or partner? You did.) By not accepting responsibility, the company is also not taking steps to prevent recurrence of the problem, which is unlikely to be a crowd pleaser. Taking broader responsibility and instituting a review of vendors involved, setting up redundancy or reviewing processes that might have contributed to the issue are all better options.
- Not knowing they are having an outage in the first place.The worst way to hear about an outage is to have your customers tell you (or possibly having your boss tell you). Having your monitoring infrastructure in the datacenter being monitored is an excellent way to have outages that you don’t get an alert about – because monitoring is off-line too. Even if your datacenter is Amazon, which is what happened to Loggly during an extended outage a few years ago.The best way: to get an alert from a unified SaaS-based platform like LogicMonitor that tells you if your whole datacenter is down. Your monitoring platform should provide a complete view of websites (including performing synthetic transaction checks), applications, databases, network, servers, virtualization and the Cloud (wherever your IT infrastructure is housed), so that you can proactively fix issues before customer experience is impacted.Below is an example of what an outage looks like in LogicMonitor graphs using SiteMonitor™, which is free with every LogicMonitor account. (Note: eBay is not currently a LogicMonitor customer.)
Improving management of outage incidents can produce better outcomes for your company’s employees, customers and shareholders. It won’t be easy. But it will be worth it. And it all starts with avoiding some basic mistakes that others have made before you.
Want more on this topic? Read the E-Book
The Top 10 Mistakes Companies Make Handling Outages and How to Avoid Them All
Did you like this post? Feel free to share or like this post. Enter your comments below.