I’m on call.
My phone rings sometime in the dead of night. I pick it up. Before looking at the screen, I say a quick prayer, silently hoping that instead of an alert, it’s actually a text from a friend (who, might I add, will promptly be unfriended in the morning).
It’s an alert…
At LogicMonitor, we utilize Amazon EC2 Container Service (ECS) as one of our main Docker orchestration frameworks. It offers flexibility in running containerized applications and is reliable in maintaining service uptime and continuous container deployment. We are already familiar with AWS services and we have a good portion of our infrastructure there so implementing ECS was the clear choice.
This particular alert was reporting that a specific task running in our ECS environment was no longer functioning properly. I went through some of the usual troubleshooting steps: checked the ECS agent, reran the container locally, perused logs, connected to the AWS Console. Nothing. The application in the container simply refused to start up.
Amazon ECS uses a different lingo than Docker, which can be a bit confusing. ECS has the notion of a “task definition,” which is a JSON representation of a “Docker run” command. It describes the containers that form your application and various parameters can be specified such as which ports to open, what image to use, or which data volumes should be mounted. After noticing the container image had recently been updated, I needed to see the diff between the current and previous task definitions to determine the problematic change.
The problem we run into is that there isn’t a quick way to compare task definition versions with each other. The current options are the following:
- Go through CloudWatch logs. This is useful to see what changed recently, but tiresome to identify multiple changes across different task definitions. It would also be difficult to compare two historical task definitions. It is, however, very useful to find out exactly who made the last change (someone who, might I add, will also be unfriended in the morning).
- The simple, side-by-side comparison. You open two browsers, tab away from Youtube, minimize your Spotify, and align the task definition JSON line by line.
- Use the AWS CLI to grab two copies of the JSON and issue a diff check.
The other option would to be use the new LM Config feature. LM Config gives engineers the ability to monitor, manage, and alert on configuration files for any device. Although LM Config is primarily used for network device configurations (switches, load balancers, firewalls, etc), its functionality can easily be extended in scintillating fashion to collect all sorts of configurations….
Like monitoring your Amazon ECS Task Definition…
Writing the ConfigSource
LM Config uses a combination of Active Discovery, custom alerting, and change detection to give users the ability to troubleshoot and correlate configuration changes with application or infrastructure performance. That is all defined via ConfigSources, templates that specify how LogicMonitor handles a device’s configuration files. I created a ConfigSource to collect the JSON template from the ECS task definitions and alert on changes we do not expect.
In ECS, when you register a task definition, you set a family name. The first task definition is given a revision version of 1. Subsequent task definitions with the same family name are given a later sequential revision number. I set each family name as an instance via Active Discovery. The AWS region is set as a property so all the task definitions it grabs will be region-specific.
For data collection, I issue an API call to AWS that describes the task definition of a specified family:revision. This is returned in JSON. Currently, I only want to check when the latest task definition has been modified so the Active Discovery is grabbing the unique family name, but this could easily be adjusted to include specific/all revisions. From there, I set the ConfigSource to alert on any changes that occur on all task definitions. It also serves as a nice backup in case an old revision is accidentally deleted. There is a copy of the task definition in LogicMonitor. This is also a very easy way to identify changes created by Terraform or other cloud infrastructure automation tools.
Using LM Config to monitor ECS task definition changes allows me to immediately identify changes to existing containers. I can quickly revert or fix problems that may arise from typos or issues in new containers. Reverting the task definition in this case allows me to get the troublesome container back up and running happily.
I have people I need to unfriend.