IT monitoring is a complex field with several approaches to manage monitoring and alerts. Most of the current monitoring solutions provide Static Threshold-Based alerting, where IT Operations staff are notified when resource utilization breaches the defined threshold. The problem with Static Thresholds is that these are manually adjusted, and tuning it to meet the specific environment and needs of an organization is a major challenge for IT Operations teams.

Dynamic thresholds, on the other hand, offer a more adaptive approach, automatically adjusting thresholds based on real-time data collection and reducing the noise of unnecessary alerts. However, they are not a one-size-fits-all solution and may not always be the best option in every scenario.

In this article, we’ll explore the pros and cons of both static and dynamic thresholds, providing insights on when to use each to best fit your monitoring needs.

When to choose dynamic thresholds

Understanding the need for varying thresholds

Identifying the proper thresholds for performance counters is no easy task. Moreover, tuning also limits the flexibility for application. It effectively means the same threshold is used across many servers despite those servers running different applications. For example, a 70% CPU utilization for a busy server is normal and doesn’t need to generate an alarm, whereas, for a relatively underutilized server, even 50% CPU utilization could mean something is wrong. Also, the same asset (such as a server or firewall) does not exhibit the same performance during different hours of the day or days of a week simply because the load is different.

My favorite example is the active directory server, which typically attracts a lot of traffic during morning hours when people log in but goes quiet during off-business hours, including weekends. Setting a reliable static threshold is always a challenge for environments where the load is not constant and shows some seasonal characteristics.

Managing alert fatigue: How dynamic thresholds help

Manually adjusting the threshold takes time and until it is achieved perfectly, real issues are not alerted by the monitoring solution. Monitoring solutions might report a lot of false positives, flooding the mailboxes of IT Operations teams with false alarms. The alert fatigue caused by the noise of false positives increases the risk of missing out on true positives.

Dynamic thresholds not only adapt to real-time data but also enable more proactive issue and anomaly detection, allowing IT teams to address potential problems before they escalate.

Handling cyclic variations with dynamic thresholds

Static Thresholds are also not very good with cyclic variations. There could be normal weekly and monthly variations in the performance counter, which is acceptable as per the needs of the business, but maintaining different thresholds for specific periods manually is time-consuming and prone to false alarms.

When to use static thresholds

Smart monitoring solutions analyze the pattern of metrics, learn what is normal in the environment, and generate alerts only when things (read metrics) are outside of the already established normal. These solutions need to be aware of cyclic variations and should cater to changes in the pattern of metric during various cycles. Since tuning is automatic, it is less of a hassle. Infrastructure monitoring tools that visualize the patterns and help to automatically create thresholds are less time-consuming than those that require manual adjusting.

Having said that, there are some scenarios where it makes more sense to use Static Thresholds. For example, when you want to be notified that a metric value changes from a previous value, i.e. on delta. In this case, it’s best to use Static Thresholds, as Dynamic Threshold works on data streams, not on the rate of change in consecutive values. Additionally, using Dynamic Thresholds on status values like API response (200, 202, 404, etc.) codes will not be helpful because response codes are not numerical values, and the confidence band generated on these will be misleading.

How LogicMonitor uses dynamic thresholds to reduce alert noise

The most prominent problem IT monitoring teams experience with Static Thresholds is the deluge of alerts and being able to understand what is truly useful and actionable out of the abundance of noise. LogicMonitor solved this problem in a phased manner with the first phase, reducing alert noise. We built a system that analyzes patterns of metrics, generates Dynamic Thresholds, and leverages these thresholds to reduce alert noise. When Static Thresholds are poorly set (or inherited from the default settings), the monitoring solution will generate countless alerts, and most of them will be useless. Now, we are using confidence bands generated by sophisticated machine learning algorithms, aka Dynamic Thresholds, to stop this alert noise. When an alert is triggered, and the value falls within the confidence band, our system will not route that alert. The alert will effectively be suppressed.

We use two independent components to achieve this alert reduction feature: one is an algorithm-centric service that generates confidence bands at regular intervals, and the second one is our sophisticated alerting system, which consumes this confidence band and, based on this, decides whether to route the alert or not.

High level architecture graph between customer infrastructure and LogicMonitor systems

This alert suppression feature was released to our customers in December 2019. While phase-1 was all about suppression, phase-2 is about generating the alerts by exploiting the bands generated by the ML algorithm. In phase-2 we are bringing the capability to define Dynamic Thresholds and generate alerts based on this definition. This gives users a powerful ability to tune the alert severity by quantifying how far the current reading deviates from the normal or baseline identified by our ML algorithm.

When alert suppression and alert generation are combined, false positives are minimized and true positives are maximized. LogicMonitor users will be getting the best of both worlds – alerts generated based on poorly set Static Thresholds will be suppressed, reducing noise, and when the metric value goes beyond a threshold, our Dynamic Threshold-based alert engine will generate an alert. We have built a sophisticated user interface to define Dynamic Thresholds and we are also providing a visual aid to help tune these settings.

Users can choose to generate a warning alert when 60% of values from the last five polls deviate by one band from the upper.

For example:

Confidence Band : (low:20, high:60, middle:40) 

HighBand : (high – middle) : 20

LowBand  : (middle – low)  : 20

So, if for the last five polls, values are 65, 82, 81, 70, 84. Here, three values [82,81,84]  (60%) are one band away from high (60), and our engine will trigger a warning alert.

Alert Engine works on a sliding window pattern, considering the last number_of_polls values for each evaluation.

Users can use the interactive chart displayed in the following image to tune the Dynamic Threshold definition.

Dynamic Thresholds definition at the instance level

LogicMonitor has also enhanced our alert workspace, now each alert generated using Dynamic Thresholds will be accompanied by a confidence band graph with additional information in it. This graph will also be sent out in email notifications.

Confidence band graph on the alert page

With this feature, LogicMonitor’s AIOps team has built a system that provides more value to customers and reduces countless hours spent manually adjusting Static Thresholds. We will continue to enhance this feature and our confidence band generator system to provide more value to customers in the future.

Implementing dynamic thresholds: A step-by-step guide

Implementing dynamic thresholds can greatly enhance your IT monitoring by reducing noise and focusing on meaningful alerts. Here’s a step-by-step guide to help you implement dynamic thresholds effectively:

Step 1: Initial configuration

Start by identifying the key performance metrics that are critical to your environment. This includes CPU utilization, memory usage, network latency, and other performance indicators that typically vary based on workload. Configure your monitoring solution to collect and analyze datasources for these metrics continuously.

Step 2: Leverage historical data

Dynamic thresholds rely on historical data to establish patterns of normal behavior. Use historical data spanning different time periods—such as weeks or months—to capture cyclic variations. For instance, analyze daily, weekly, and seasonal trends to set a solid baseline. This baseline will be crucial for the system to automatically adjust thresholds according to the typical behavior observed during different cycles.

Step 3: Fine-tune threshold sensitivity

Not all alerts are created equal. Adjust the sensitivity of your dynamic thresholds to align with the criticality of each metric. For example, set tighter thresholds for metrics where deviations can lead to immediate service impacts and looser thresholds for less critical metrics. Use a sliding window pattern to evaluate metric deviations over recent data points, allowing your system to respond swiftly to real-time changes while avoiding overreaction to minor fluctuations.

Step 4: Integration with existing monitoring systems

Ensure that your dynamic thresholding integrates seamlessly with your existing IT monitoring tools. This might involve configuring APIs, plugins, or other connectors to feed data into your monitoring solution. It’s essential to keep your monitoring environment cohesive so that alerts generated from dynamic thresholds are visible and actionable alongside alerts from static thresholds or other monitoring rules.

Step 5: Continuous monitoring and adjustment

Dynamic thresholds are not a set-and-forget solution. Continuously monitor the performance of your dynamic thresholds and adjust them as your environment evolves. Regularly review alerts and threshold calculations to ensure they still reflect current operational patterns. Implement a feedback loop where alerts and responses are used to fine-tune the system further, enhancing its accuracy over time.

Take the next step with LogicMonitor

Ready to reduce alert noise and improve your IT monitoring with dynamic thresholds? Discover how LogicMonitor’s AIOps Early Warning System can help you streamline incident management and optimize performance. 

People around the world depend on Managed Service Providers (MSPs) to keep their businesses running like clockwork, even as their IT infrastructure evolves. Keeping workflows efficient leads to higher profits, but this can be a challenge due to a mix of on-premises infrastructures, public and private cloud, and other complex customer environments. The shift to remote work in 2020 due to the COVID-19 pandemic has only made this more challenging for MSPs. In order to adapt, more and more MSPs are investing in IT infrastructure monitoring and artificial intelligence (AI). 

Keep reading for an overview of the LogicMonitor AIOps Early Warning System and how dynamic thresholds can mitigate these common challenges and add value for MSPs.

The AIOps Early Warning System

The AIOps Early Warning System intelligently detects signals from noise. This helps you identify where to efficiently allocate your engineers’ time. Quickly identifying these signals also helps you resolve issues faster. The Early Warning System consists of four main components: anomaly detection, dynamic thresholds, topology, and root cause analysis. 

Anomaly Detection

Anomaly detection lets you visualize expected performance and compare it to historical offsets. Within the LogicMonitor platform, you can pop out an anomaly detection view to see whether what you’re seeing is normal performance or an anomaly. This saves engineers time in the troubleshooting process by allowing them to eliminate metrics or indicators that aren’t relevant. 

Dynamic Thresholds

Dynamic thresholds expand on the visual anomaly detection that we offer in our metrics. Dynamic thresholds limit alert notifications based on normal behavior, such as knowing that the CPU on a server is always hot during a certain time of day. Since it detects and alerts on deviations from normal behavior, dynamic thresholds will allow you to catch deviations like a server CPU going to 0% when it is supposed to be busy.

Topology

Topology automatically discovers relationships for monitored resources. It is a key part of the next component, root cause analysis.

Root Cause Analysis

Root cause analysis leverages topology to limit alert notifications. It identifies what the root incident is and group alerts due to dependencies. For example, if a firewall goes down, LogicMonitor knows what other things depend on the firewall and will only send one alert instead of many. 

DC4 network mapping in LogicMonitor

How Dynamic Thresholds Add Value For MSPs

Combined with other features from LogicMonitor’s Early Warning System, dynamic thresholds can help MSPs more proactively prevent problems that result in business impact. Let’s dive a little deeper into why dynamic thresholds are a key component in issue detection. 

#1- Increase Productivity

The biggest benefit of dynamic thresholds is the fact that it saves engineers time. By detecting a resource’s expected range based on past performance, dynamic thresholds reduce alert noise and only send alerts when an anomaly occurs. This means that the alerts that engineers receive are meaningful. They spend less time looking at alerts and can help more customers. 

#2- Resolve Issues Faster

Dynamic thresholds don’t make you wait for the static amounts to be hit, which could take hours or days. It quickly detects deviations and determines whether the alert is a warning, error, or critical. As soon as an anomaly is detected, an alert is sent to get human eyes on it. Being able to hone in on the exact cause of the alert provides engineers with more context so issues can be resolved faster. 

#3- Reduce Costs

Along with saving time and resolving issues quicker, dynamic thresholds also allow MSPs to reduce costs. Experienced engineers, which are expensive, no longer need to handle monitoring and can focus on other areas of the business. Dynamic thresholds make the task of chasing thresholds easier, and less experienced engineers are empowered to do monitoring and really understand what’s going on and where their attention needs to be focused. Less experienced engineers using less time to figure out issues means more money in your pocket.

Top Virtual Machines vy CPU utilization dashboard graph in LogicMonitor

The intelligence of dynamic thresholds combined with LogicMonitor’s comprehensive monitoring coverage ensures that MSPs have the visibility they need to succeed even in the most complex of environments. To learn more about how LogicMonitor can reduce costs and accelerate workflows for MSPs, check out this on-demand webinar.   

Whether you are new to the Cloud, mid-transition, or a professional at cloud or hybrid systems, no one likes being bothered with useless alerts. The options are simple: 

  1. Ignore the alert and hope it goes away
  2. Spend hours tuning and adjusting alerts
  3. Let the system tell you when an alert should be looked at

If you take the approach of ignoring the alert like a bad cold-call, you risk the chance of missing a critical alert and watching your system crash around you. No one likes to open their inbox to a few hundred alerts they have been ignoring. 

You could spend hours tuning and adjusting your alerts and monitoring, only to find out that the ephemeral nature of the Cloud means that some of those well-tuned instances have spun down to be replaced by new instances. You may also spend the time and later find out that the demand on your system has changed significantly, rendering these changes almost useless. Also, remember that tuning you did late Thursday night because you were tired of a cloud service bouncing back and forth between alerting and fine? You didn’t notice that your cat Fluffy sat on the keyboard while you refilled your drink and changed that warning threshold from “95” to “95afhyDESTROYKEYBOARD128”. 

Monitor Your Cloud Platforms With Confidence

Don’t worry, there is a smarter way. Using dynamic thresholding on your cloud platforms means that you can rest easy knowing that it will generate alerts for anomalies. Using dynamic thresholding will even alert you early when systems are changing over time, giving you plenty of time to remove Fluffy from your laptop and make any changes you need to prevent downtime.

Top 10 Versions by CPU Utilization dashboard showing dynamic thresholds in LogicMonitor

Let the System Work for You

Cloud monitoring can be difficult in the best of cases. In reality, it is even harder. Complications from having multiple environments spread across multiple clouds and on-premise devices can keep you up at night. LogicMonitor can help you monitor your performance across Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), and your own hardware. Utilizing LogicMonitor’s Dynamic Thresholds across these different environments means that you don’t have to log in to each system to get a report, and you also have the confidence that your system will warn you early when systemic changes are threatening system health.

Dynamic Thresholds not only knows when and how to alert you to issues, but they can also suppress alerts that may otherwise be triggered when it detects time-based patterns. This allows you to rest easy knowing that alerts will be meaningful and if there is an issue at any time, it will be alerted to you.

Connections dashboard in LogicMonitor

Visualize Success

Our cloud monitoring can be viewed next to your on-premise monitoring and across your cloud instances. This means you can visualize your data at the touch of a button. You can visualize this data with an understanding of how Dynamic Thresholds will react to the data and see the bands where alerts would not be sent. This allows you to understand and even tune your thresholding if you want. Or, you can kick back and relax knowing that it is being handled by LogicMonitor.

CPU Utilization dashboard showing high thresholds on the LogicMonitor platform.

Combined with other features from LogicMonitor’s Early Warning System – such as root cause analysis and forecasting – dynamic thresholds can help you more proactively prevent problems that result in business impact. To learn more about LogicMonitor’s AIOps Early Warning System or to see it in action, sign up for a free trial.

Have you ever been paged for a critical issue and started troubleshooting only to find an obvious drop in requests that weren’t caught by a static threshold? Or a significant increase in a metric that didn’t cross a static threshold? Or even, evidence of warning alerts triggered long ago that should have enabled someone to resolve the issue and prevent it from causing business impact, but instead was ignored in the massive alert volume received by the team?

LogicMonitor already has comprehensive out-of-the-box alerting set up to help avoid these issues. This setup leverages best practices thresholds at three severity levels which are intended to raise issues for a typical production environment. Dynamic thresholds suppression is intended to filter out noisy alerts.

Alerts for Metric Anomalies

Today we’re announcing enhancements to our dynamic thresholds that will take this one step further! With these enhancements, dynamic thresholds will also generate alerts for metric anomalies – enabling faster and easier detection for the scenarios highlighted above.

Dynamic thresholds rely on anomaly detection algorithms to calculate an expected range for a resource’s performance, based on historical data. This expected range is used to identify anomalies that should generate alerts and suppress notifications for alerts that correspond to non-anomalous performance. This ensures that alerts are generated for the right things at the right time.

Dynamic thresholds don’t just identify anomalies in metric values, they also detect anomalies in metric rate of change (e.g. a disk that starts filling up really quickly) and time-based patterns (seasonality – e.g. a VM that normally backs up daily or weekly).

How to Get Started

It’s easy to get started, just enable the dynamic thresholds for desired severities, and the expected range is auto-generated. Of course, consistent with everything else in LogicMonitor, we worked hard to make this auto-generated range widely applicable with sensible defaults, but you can customize it if needed with the ‘Advanced Configuration’.

Once it’s enabled, you can rest assured that dynamic thresholds will generate alerts for anomalies (as well as suppress notifications for static threshold-generated alerts that don’t correspond to anomalies).

Compatible With Static Thresholds

One unique aspect of LogicMonitor’s dynamic thresholds is its ability to work well with static thresholds. It’s the best of both worlds: dynamic thresholds will compensate for static thresholds that are poorly tuned or aren’t set at all (by suppressing notifications where too many alerts are generated, and triggering alerts where alerts should be generated but aren’t), and defer to static thresholds where they are set well. The combination ensures the right alerts are generated at the right times for your team while minimizing the overhead of tuning alert conditions manually.

Combined with other features from LogicMonitor’s Early Warning System – such as root cause analysis and forecasting – dynamic thresholds can help you more proactively prevent problems that result in business impact. This intelligence, combined with LogicMonitor’s comprehensive monitoring coverage – from infrastructure to applications to network, running on-premises or in the cloud – ensures you have the visibility you need to succeed even in the most complex of environments. To learn more about LogicMonitor’s AIOps Early Warning System or to see it in action, sign up for a free trial

Hybrid Monitoring Platform Selection 101: Top 5 Capabilities to Look For

When looking for a new monitoring platform, the options available today can seem overwhelming. SaaS, on-premises, cloud-native… where do you even start? Below, find a breakdown of five key capabilities to be on the lookout for when comparing IT monitoring platforms. Although some of these capabilities are now must-haves for all platforms, certain providers still offer up richer feature sets and benefits.    

1. Configurable Dashboards

Dashboards can provide either a business-centric view for the user or a tactical action-oriented view. They need to be easily customizable to display both relevant, high-level cost information to the CIO and a more granular view for IT Operations teams. Configurable dashboards make it easy to absorb and share important information across different teams within a large enterprise. Ensure any monitoring solution under consideration comes with easy-to-configure dashboard functionality.

2. Pre-Built Workflow Integrations

To increase user efficiency, IT infrastructure monitoring platforms must be able to integrate with other IT operation management tools. Monitoring platforms that integrate with messaging and ticketing systems allow users to view all relevant information within a single pane of glass. Look for platforms that come with pre-built software integrations. Out-of-the-box integrations automate workflows for users while providing a single source of truth. 

3. Dynamic Thresholds

An effective monitoring platform will have pre-configured thresholds set according to domain-specific best practices. This ensures that meaningful alerts are triggered out-of-the-box. Dynamic Thresholds take this a step further by calculating an expected range for a resource’s performance and only sending out notifications for triggered alerts that correspond to values outside of this range. This ensures that alerts are only sent out for anomalies and teams only get notified when issues truly need their attention. 

4. Dependency Maps and Topological Views 

To avoid wasting time on downstream alerts, look for a platform that offers dependency maps and topological views. These capabilities speed up problem diagnostics and enable faster troubleshooting with root cause analysis (RCA). Root cause analysis identifies where an incident occurs and what its impact is on dependent resources. This feature reduces alert noise and enables IT operations engineers to focus on solving the originating issue.   

5. REST APIs 

Does the monitoring solution provider you’re evaluating offer APIs to speed along integrations? Look for vendors that offer REST APIs. These allow users to programmatically query and manage resources including dashboards, devices, reports, websites, alerts, collectors, datasources, SDTs and more. The more extensible a platform is, the more resources and integrations can run through the platform. APIs are utilized for a friendlier interface with virtual environments and other IT operations management tools so that the effort needed for integration is minimal.  

Although there are many choices out there, LogicMonitor is the only SaaS-based platform that offers comprehensive IT infrastructure monitoring and intelligence capabilities for on-premises, multi-cloud, and everything in between. Interested in learning more about LogicMonitor’s capabilities? Connect with your customer success manager or attend a weekly demo to see LogicMonitor in action.

Recently a customer contacted us at support for help setting up alert thresholds on a custom datasource they had written. Upon inspecting the datasource in the customer’s LogicMonitor portal, I realized they were monitoring an HL7 feed, which I found intriguing.

HL7 stands for Health Level 7, and refers to the standards and methods of moving clinical data between independent medical applications (commonly used within hospitals). The feed is comprised of human readable text messages broken up into sections, much like an email. One of the most common types of HL7 transmissions are ADT messages, which are used to update admissions, discharges, and transfers within a patient’s clinical data record.

Below is an example of an HL7 message:

MSH|^~\&|MegaReg|XYZHospC|SuperOE|XYZImgCtr|20150529090131-0500||ADT^A01|01052901|P|2.5
EVN||201505290901||||201505290900
PID|||56782445^^^UAReg^PI||DOE^JOHN||19620910|M||2028-9^^HL70005^RA99113^^XYZ|200 E 6TH ST^^AUSTIN^TX^30
OBX|1|NM|^Body Height||6|f^Feet^ISO+|||||F
OBX|2|NM|^Body Weight||180|lb^Pounds^ISO+|||||F
AL1|1||^ASPIRIN

In the message above you’ll notice each segment contains an object identifier, followed by that section’s relevant information. For example, PID identifies the patient and contains their date of birth and home address. Observations (vitals) and allergies are noted within the record as well to prevent any potentially dangerous reactions to prescriptions.

With the constant updating of a patient’s data, the ADT feed allows information to be transmitted from a clinic or hospital information system to an external application or provider in near real-time. Once updated, the clinical data is usually accessed or needed in many different places such as outpatient clinics or labs. Fortunately, our customer was not interested in following or recording personal data – which may have moved the monitoring system into scope for HIPPA compliance. Their goal was only to monitor the feed’s status. To do this, their datasource uses a webpage collector to query data from the feed’s API using a GET request. The message feed status was determined using a datapoint that measures the amount of time since a new message is posted. Depending on how busy the hospital location is, there are times when it is entirely acceptable to have zero new messages posted in any particular feed. Once a considerable amount of time has passed with no new messages, however, the customer wanted to be alerted so they could investigate.

After working with the customer on this particular issue I came to realize how flexible LogicMonitor is. In addition to the hundreds of supported devices, services, and apps LogicMonitor supports, it can be customized to meet any monitoring need. Even for use cases we hadn’t thought of our selves!