Automation is part of the cultural DNA at LogicMonitor. Whether it's automating parts of our job or automating network monitoring so our customers don’t have to think about it, the goal is always evolving. I work in the Technical Operations team (by the way, we’re hiring!) which handles availability of our SaaS-based LogicMonitor application for all customers, including ourselves. We are always looking for opportunities to automate. Our creative side aspires for inventive solutions that also remove the human error factor. We use automation to solve scalability problems - the days of copying and pasting commands across a couple of remote terminals are long gone. Seven webservers? How about seven hundred, with frequent changing hostnames.
When I started at LogicMonitor nearly two years ago, several projects were well underway to break apart a monolithic application into a Service Oriented Architecture. This meant more servers, more containers, more complexity, and so on. Coupled with this change was constant growth. The larger problem to solve was two-fold: how do we manage an increasingly complex infrastructure, ensuring that all of our production environments stay in-sync so all customers have a consistent experience; and where in the world are all of our applications running? New approaches and tools were needed to address these problems.
Problem 1: Sustainably Scaling Infrastructure
We had been using Amazon Web Services (AWS) for several years, for the typical reasons one adopts use of the Public Cloud. A variety of hand-built tools had been developed to help manage the creation of these resources. We had bits and pieces here and there but nothing to comprehensively manage our "infrastructure templates". This approach resulted in inconsistent results. Even after combing over our documentation runbooks there was often a missing command run, or API call invoked, that was needed to bring the environment to production status (snooping through your co-workers .bash_history anyone?). Cue Terraform to the rescue.
From the HashiCorp website, "Terraform lets operators easily use the same configurations in multiple places to reduce mistakes and save time". In other words, you write snippets of code which Terraform uses to create all of your cloud resources such as EC2 instances or SQS queues. The code can be written in a template format so that you can easily create multiple environments by changing a couple of variables. We already had a positive experience making use of Packer, another HashiCorp tool, to build our Amazon Machine Images (AMIs) so we gave Terraform a try. Our first use of Terraform involved creating testing/quality assurance environments and the experiment went well. When you are ready to remove the resources Terraform can take care of that too. More members of the team got involved and Terraform soon became our standard for creating any cloud resource.
One unforeseen upside of adopting Terraform was that the code you write serves both as documentation for how infrastructure is built as well as a description of existing infrastructure. All of those undocumented commands became part of the code. We were able to provision new infrastructure to be the same as old, as well as keep our older infrastructure up-to-date as we make changes along the way.
Problem 2: Where is that process running?
Adopting a SOA model drives an ever expanding, complex and dynamic infrastructure. Service coordination tools such as Zookeeper are commonly found in these types of environments. We realized we needed a way to perform dynamic service discovery - both so that services could know how to talk to one another and for us operators to determine where any given service was running. Consul, another tool from HashiCorp, filled this requirement and more. Consul lets you write your own custom service checks (say a GET request to a specific REST endpoint, match an expected return code and capture the output) which you can then query to determine both health status and the location where the service is running.
In addition to providing a service inventory we made several more integrations with Consul for the purpose of automation and reliability. Our software deployments use Consul queries to dynamically determine which applications need upgrading and where they are running. Our proxies and load balancers use consul-template to automatically write their routing configuration files. This has resulted in less chance for human error and provided a scalable model so we can continue to manage more complex systems.
These are examples of how we solved some of our infrastructure growing pains using modern tools. If you're attending HashiConf in Austin next week, I'll be down there with my team to answer any questions around these topics. Not attending the show and want to know more about the forthcoming LogicMonitor + Terraform integration? Stay tuned - we have a blog post just for you.