When designing infrastructure architecture, there is usually a choice between complexity and fault tolerance. It’s not just an inverse relationship, however. It’s a curve. You want the minimal complexity possible to achieve your availability goals. And you may even want to reduce your availability goals to reduce your complexity (which will end up increasing your availability.)
The rule to adopt is If you don’t understand something well enough that it seems simple to you (or your staff), even in it’s failure modes, you are better off without it.
Back in the day, clever people suggested that most web sites would have the best availability by running everything – DB, web application, everything – on a single server. This was the simplest configuration, and the easiest to understand.
With no complexity – one of everything (one switch, one load balancer, one web server, one database, for example) – you can tolerate zero failures, but it’s easy to know when there is a failure.
With 2 of everything, connected the right way, you can keep running with one failure, but you may not be aware of the failure.
So is it a good idea to add more connections, and plan to be able to tolerate multiple failures? Not usually. For example, with a redundant pair of load balancers, you can connect one load balancer to one switch, and the other load balancer to another switch. In the event of a load balancer failure, the surviving load balancer will automatically take over, and all is good. If a switch fails, it may be the one that the active load balancer is connected to – this would also trigger a load balancer fail over, and everything is still running correctly. It would be possible to connect each load balancer to each switch, so that failure of a switch does not impact the load balancers, but is it worth it?
This would allow the site to survive two simultaneous unrelated failures – one switch and the one load balancer – but the added complexity of engineering the multiple traffic paths increases the likelihood that something will go wrong in one of the 4 possible states. There are now 4 possible traffic paths instead of 2 – so more testing needed, more maintenance needed on any change, etc. The benefit seems outweighed by the complexity.
The same concept of “if it seems complex, it doesn’t belong”, can be applied to software, too. Load balancing, whether via an appliance such as Citrix Netscalers, or software such as ha_proxy, is simple enough to most people nowadays. The same is not generally true of clustered file systems, or DRDB. If you truly need these technologies, you better have a thorough understanding of them, and invest the time to create all the failure modes you can, and train your staff so that it is not complex for them to deal with any of the failures.
If you have a consultant come in and set up BGP routing, but no one on your NOC or on call staff knows how to do anything with BGP, you just greatly reduced your site’s operational availability.
The “Complexity Filter” can be applied to monitoring systems, as well. If your monitoring system stops, and you don’t have immediate staff available to troubleshoot the restart of the service processes; or the majority of your staff cannot easily interpret the monitoring, or create new checks, or use it to see trends over time – your monitoring is not contributing to your operational uptime. It is instead a resource sink, and is likely to bite you when you least expect it. Datacenter monitoring, like all things in your datacenter, should be as automated and simple as possible.
If it seems complex – it will break. Learn it until it’s not complex, or do without it.