The safety catch to this is good monitoring. We demonstrated this to ourselves this morning. We did a software release on some of our servers last night. This particular release involved quite a few changes to various components, including to various configuration files. As we have a excellent tech ops team, they automate all our server configurations with puppet, so that everything is scalable, repeatable, and manageable over a growing fleet of infrastructure.
So last night, we upgraded a portion of our servers – the new puppet configuration that matched the new software deployment was run; configuration files modified; the software upgraded, and everything was happy.
Until this morning, when we got an alert about one of the upgraded servers: it was no longer submitting requests to the sitemonitor service, which checks websites for performance, availability and reachability from various places on the Internet, external to the customer’s infrastructure. The quickly identified cause, apparent in a log file at the time the alert was triggered, was that the server, running the new code, suddenly was trying to talk the sitemonitor service using a configuration that only worked with the old code.
Why? Well, not all servers were upgraded to the current release last night. So, we have some servers running the older code, and some running the new code – each with different configuration file requirements controlled by puppet. Now usually puppet runs as a cron job, and just does its puppet run from the standard production manifest. But we didn’t have a good system to apply two different puppet configurations to two different classes of servers, depending on the version of production code they were running. So, as an interim solution, the engineer that did the release stopped the periodic cronjob for puppet on the machines that were upgraded; ran puppet to apply the new version manifest; upgraded the software, and left puppet disabled, intending to re-enable it after all servers were updated, and the changes could be applied to the production puppet manifest that all servers run.
However, among the many things we monitor on our infrastructure is whether puppet has run correctly on the servers. So, naturally, an alert was triggered that puppet had not been running on the upgraded servers; another member of the techops team corrected the problem (by re-enabling the puppet crontab); puppet reset the changed files to the prior release of configuration files (which is what all the other machines that had not been upgraded were using); and the servers could no longer communicate correctly with the sitemonitor service.
But as I mentioned, this also triggered an alert, allowing us to revert the incorrect puppet run, restoring the briefly impacted service for the small set of customers affected.
- clearly our puppet infrastructure needs to better deal with servers expecting different configurations depending on the code version they are going to run. The right version can’t always be determined at runtime by the running puppet process querying the running code, as often the changes need to occur before the new code can be installed and applied. Our team has some ideas to fix this (database lookups to inform of the desired code level, updated as part of pre- &/or post release processes).
- the first engineer that did the release should have put the monitoring of the puppet success into schedule downtime on the hosts where he disable the puppet runs. This would have prevented the puppet alert which caused the second engineer to ‘correct’ the fact that puppet was not running.
- we also need better internal communication – always a challenge, given we have east and west coast tech ops team members.
- monitoring comprehensively is essential, so we knew when our configuration management tool propagated the error, and which servers had been affected.
Monitoring lets us avoid another Devops Borat aphorism: “In startup we are practice Outage Driven Infrastructure.”