If your infrastructure has to be up at all times (or as much as possible), how to best achieve that? In an Active/Active configuration, where all parts of the infrastructure are used all the time, or in an N+1 configuration, where there are idle resources waiting to take over in the event of a failure?
The short answer is it doesn’t matter unless you have good monitoring in place.
The risk with Active/Active is that load does not scale linearly. If you have two systems running at 40% load, that does not mean that one will be able the handle the load of both, and run at 80%. More likely you will run into an inflection point, where you will run into an unanticipated bottleneck – be it CPU, memory bandwidth, disk IO, or some system that is providing external API resources. It can even be the power system. If servers have redundant power supplies, and each PSU is attached to separate Power Distribution Units (PDUs), the critical load for each PDU is now 40% of the rating. If one circuit fails, all load switches to the other PDU – and if that PDU is now asked to carry more than 80% of its rating, overload circuits will trip, leading to a total outage. There is some speculation that a cascading failure of this type was behind the recent Amazon EC2 outage.
The risk with N+1 is that, by definition, you have a system sitting idle – so how do you know it’s ready for use? Oftentimes, just when you need it, it’s not ready.
Of course, the only 100% certain way of knowing your infrastructure can deal with failures is to have operational procedures in place that test everything – actually failover.
But in between the regularly scheduled failover events, you need to monitor everything. Monitor the PDUs, and if you are running Active-Active, set the thresholds to 40%. Monitor your standby nodes, and monitor the synchronization of the configuration of the standby nodes (if you use Netscalers, F5 Big IPs, or other load balancers, you do not want to experience a hardware failure on your primary node, only to fail over to a secondary node that is unaware of the configuration of any of your production VIPs.) Monitor all systems for RAID status, monitor secondary router paths for BGP peerings, monitor backup switches for changes in link status, temperature sensors, memory pool usage and fragmentation.
If you notice, virtually all the outage announcements companies issue promise to improve their monitoring to prevent such issues.
At LogicMonitor, we suggest you implement thorough monitoring first, to avoid as many of these issues as you can in the first place. LogicMonitor makes that simple.