Simplify Troubleshooting with AIOps

There is a lot of industry buzz around how AIOps will affect change within IT Operations (ITOps). According to Gartner, Inc., the term “AIOps” describes platforms that combine big data and machine learning to support ITOps. This means that the problems being solved aren’t novel, the approach is. In ITOps or any other business unit, there are two primary constraints: time and money. All other problems are derivatives of these two. For any solution or software that is instrumented, there is a perceived optimization in either of these two categories. The recent fixation to instrument AIOps based technologies can be attributed to three challenges that ITOps is facing today – volume, variety, and velocity of data generated. Ultimately, understanding and managing these three issues efficiently results in time and money savings.

Understanding infrastructure dependencies helps manage the volume of data that enterprises are currently tasked with handling. Intelligent alert suppression based on dependency enables users to focus more on the actionable alerts and filter out excessive downstream noise. Due to the sheer volume of infrastructure enterprises are currently tasked with managing, the ability to cut out excessive alerts generated as a result of a switch, service, or other critical resource going down greatly reduces the time spent troubleshooting and decreases the MTTR (mean time to resolution). Once the dependency relationships have been established, the root cause of an issue will become much more apparent, due to the fact that the only alert the user will receive will be at the root of the issue (such as a switch going down). This time-saving results in the ability for ITOps to focus more on business-critical tasks, such as a cloud migration, and less on troubleshooting meaningless alerts.  

Dynamic alerting based on anomaly detection helps ITOps manage the variety of data they interact with and basing alerts off of the normal behavior of a signal or metric takes the guesswork out of setting thresholds. Leveraging this type of technology augments existing static thresholds present within modern monitoring tools and enables ITOps to monitor and alert on metrics that cannot be easily adapted to static thresholds, such as ephemeral environments, services, etc. Having an algorithm that is leveraging the historical data of a signal predicate what is anomalous frees up the time spent configuring traditional static thresholds. On top of this, by leveraging the anomaly detection algorithm for alerting, users will become more proactive in their monitoring since they may be able to catch issues before they even breach a static threshold. Catching anomalous behavior in a signal before it breaches a traditional static threshold may deter a potential service outage, which can be detrimental to a business. By leveraging a dynamic alerting strategy based on anomaly detection, ITOps can save time setting static thresholds and potentially avoid a service-disrupting outage.

Both of these functions, alert suppression by dependency and anomaly detection, are vital in laying a solid foundation for additional AIOps tooling. They provide intelligence relevant for managing the current state of the infrastructure and assist ITOps by reducing troubleshooting time. However, these two approaches only provide a glimpse of how AIOps can help ITOps teams manage the growing volume, variety, and velocity of data.

While anomaly detection and alert suppression certainly solve problems present within ITOps, they lack the systemic knowledge of how to resolve an issue as it arises. One of the largest value gains that AIOps will provide ITOps, comes in the form of “opinionated monitoring”. Opinionated monitoring means that monitoring platforms are not only intelligent in their analysis, but they understand what to do next in the event of an alert or incident. For many situations, this involves simple tasks, such as restarting an SNMP service, rebooting a host in the event it goes down unexpectedly, or simply providing guidance to admins on how to remediate an issue. All of these scenarios rely on an AIOps platform to learn what the common scenarios in ITOps are and how to potentially troubleshoot. This layer of action-oriented and opinionated monitoring provides immediate time and money savings for ITOps by enabling engineers to focus less on small, time-consuming tasks, and focus more on business critical issues.

While AIOps is certainly exciting, it is important to manage expectations and understand that the need to optimize both time and money will always be present.  AIOps empowered technologies only provide a new and innovative way to solve them.