Opinion

Monitoring as an acceptance-test for configuration management tools


Steve Francis by Steve Francis, Founder and Chief Product Officer at LogicMonitor, Apr 12, 2013

As Devops Borat says:  “To make error is human. To propagate error to all server in automatic way is #devops.”

The safety catch to this is good monitoring. We demonstrated this to ourselves this morning.  We did a software release on some of our servers last night. This particular release involved quite a few changes to various components, including to various configuration files.  As we have a excellent tech ops team, they automate all our server configurations with puppet, so that everything is scalable, repeatable, and manageable over a growing fleet of infrastructure.

So last night, we upgraded a portion of our servers – the new puppet configuration that matched the new software deployment was run; configuration files modified; the software upgraded, and everything was happy.

Until this morning, when we got an alert about one of the upgraded servers: it was no longer submitting requests to the sitemonitor service, which checks websites for performance, availability and reachability from various places on the Internet, external to the customer’s infrastructure.  The quickly identified cause, apparent in a log file at the time the alert was triggered, was that the server, running the new code, suddenly was trying to talk the sitemonitor service using a configuration that only worked with the old code.

Why? Well, not all servers were upgraded to the current release last night. So, we have some servers running the older code, and some running the new code – each with different configuration file requirements controlled by puppet.  Now usually puppet runs as a cron job, and just does its puppet run from the standard production manifest.  But we didn’t have a good system to apply two different puppet configurations to two different classes of servers, depending on the version of production code they were running. So, as an interim solution, the engineer that did the release stopped the periodic cronjob for puppet on the machines that were upgraded; ran puppet to apply the new version manifest; upgraded the software, and left puppet disabled, intending to re-enable it after all servers were updated, and the changes could be applied to the production puppet manifest that all servers run.

However, among the many things we monitor on our infrastructure is whether puppet has run correctly on the servers.  So, naturally, an alert was triggered that puppet had not been running on the upgraded servers; another member of the techops team corrected the problem (by re-enabling the puppet crontab); puppet reset the changed files to the prior release of configuration files (which is what all the other machines that had not been upgraded were using); and the servers could no longer communicate correctly with the sitemonitor service.

But as I mentioned, this also triggered an alert, allowing us to revert the incorrect puppet run, restoring the briefly impacted service for the small set of customers affected.

Lessons?

Monitoring lets us avoid another Devops Borat aphorism: “In startup we are practice Outage Driven Infrastructure.”