The dangers of dead code in infrastructure and monitoring

It is relatively well understood in development that dead code (code that is no longer in use, due to refactoring, or changes in features or algorithms) should be removed from the code base. (Otherwise it introduces a risk of bugs, and makes it much harder for new developers to come up to speed, as they have to understand the dead code, and if it is in fact in use, etc.)  It is less well understood that the same principles apply to the rest of the IT infrastructure as well.

A trivial example: removal of an unused virtual IP on a load balancer, after a site is no longer in use.  Superficially, there is zero risk or cost in leaving this configuration in place. Indeed, it is arguably less risky, and less cost (work) to remove it – as any admin knows, the introduction of change is when most outages occur, so change is often avoided.  But what happens when a new person has to be involved in the load balancer’s operation? Or a year from now, when the details of what sites are active, and which are not, is forgotten and fuzzy? Unfortunately, there is unlikely to be a big sign indicating which sites are dead.  If at that time there is a move to new address space, considerable effort will go in to migrating the unused IP virtual IP: address allocations; DNS changes; reconfiguration of the VIP, updating tracking databases, etc.  Coupled with uncertainty as to whether the changes were successful, as it’s hard to verify that sites with no traffic are working correctly.

Couldn’t you simply determine the site is unneeded in the future? Not simply. Perhaps the site is only accessed at quarter end, or year end.

Monitoring system sprawl can be another form of dead code creep.  Your snazzy new storage system comes with its own monitoring system, so the storage department relies on that instead of consolidating monitoring into the enterprise monitoring system.  Same for the new DB; the new wireless access points, and so on.  Soon you have 10, 20, even over 100 different monitoring systems, each with a siloed view of data.  This of course make tracing performance from users, to application, to guest VM, to hypervisor, to switch, to storage, and back, and correlating all the elements and additional workloads, non-trivial, to say the least. (Usually involving finger pointing, and finally, after several days and the involvement of senior management, resulting in a war room to finally determine the root cause.)

Development is usually goaled on delivering features – or causing change. IT operations are usually goaled on keeping things up and available – or minimizing change.

If enterprise IT is to become more agile, and deliver services that align with the expectations of their consumers (who are now all used to Apple Store instant provisioning, and frequent app updates), then this will have to change.  IT will have to become more like development – embrace change; realize the risk of cleaning up dead code (stale DNS entries; network configurations and access lists; load balancer configurations; virtual machines; puppet manifests, etc) now is less than the risk of maintaining it in the future.  Similarly, realize that the effort of integrating point monitoring into an enterprise system is going to quickly pay off in the long term – most likely at the next trouble incident. Or before that, when the next employee starts, and no longer has to learn multiple systems; or when less systems are required for the next server refresh as there are only 3 monitoring systems running, not 15.

With the pace of change in the datacenter accelerating with virtualization, cloud, and hybrid cloud, the importance of IT removing dead code and systems is becoming more significant every day.

How is your department dealing with this?