Enabling Dynamic Thresholds for Datapoints
IN THIS ARTICLE:
FEATURE AVAILABILITY: The dynamic thresholds feature is available to users of LogicMonitor Enterprise.
Introduction to Dynamic Thresholds
Dynamic thresholds represent the bounds of an expected data range for a particular datapoint. These thresholds are based on anomaly detection algorithms that evaluate the three days of historical data immediately preceding.
When dynamic thresholds are enabled for a datapoint, alert notification routing is suppressed if the triggering value (as determined by a static datapoint threshold) is not anomalous (i.e. falls within the bounds of the expected data range). In other words, static thresholds still determine whether an alert is triggered, but dynamic thresholds determine whether the subsequent alert notifications (as configured by alert rules) are routed. The goal is to filter routed alert notifications to just those that represent anomalous data, thus reducing alert noise. Regardless of whether alert notifications are routed or suppressed based on dynamic thresholds, the originating alert itself always displays within the LogicMonitor interface.
By default, LogicMonitor alerting and alert notification routing behavior for datapoints is based on static thresholds. And while this default behavior is ideal for many types of metrics (e.g. disk usage), there may be situations where static thresholds alone aren't flexible enough to intelligently determine whether a condition is truly cause for alarm.
For example, consider an organization that has optimized its infrastructure so that some of its servers are intentionally highly utilized at 90% CPU. This runs afoul of LogicMonitor's default static CPU thresholds which typically consider ~80% CPU (or greater) to be an alert condition. The organization could take the time to customize the static thresholds in place for its highly-utilized servers to avoid unwanted alert noise or, alternately, it could globally enable dynamic thresholds for the CPU metric. With dynamic thresholds enabled, alert notifications for servers that historically operate at 90% CPU will not be routed. However, alert notifications for servers that aren't intentionally highly utilized and run, let's say, at 70% CPU, would be routed if the static threshold of 80% was exceeded.
For situations like this, in which it is more meaningful to determine if a returned metric is anomalous (i.e. falls outside of the expected range in addition to exceeding a static threshold), you may want to consider enabling dynamic thresholds for that metric.
Note: LogicMonitor features detailed anomaly detection graphs that allow you to visually see expected value ranges for any given datapoint/instance. As discussed in Anomaly Detection Visualization, every graph viewed from within the LogicMonitor interface offers this anomaly detection view.
Enabling Dynamic Thresholds
Note: Because dynamic thresholds are determined based on three days' worth of collected data, the feature has to be enabled for three days before it will begin suppressing alert notification routing.
When enabling dynamic thresholds for a datapoint, you have the option of enabling per alert severity. For example, as shown next, alert notification routing suppression could be considered for warning alerts only; error alerts and critical alerts would continue to be routed as usual, without consideration for whether the datapoint value is within the expected data range.
Similar to static thresholds, there are multiple levels at which dynamic thresholds can be enabled:
- Global DataSource level. Dynamic thresholds enabled at the global DataSource level cascade down to every instance (across all resources) to which the DataSource is applied.
- Instance level. Dynamic thresholds enabled at the instance level can be configured to apply to a single instance on a single resource, multiple instances on a single resource, or all instances on a single resource.
For example, if you want dynamic thresholds to apply to all relevant resources in your network infrastructure (i.e. every single instance to which the DataSource could possibly be applied), you would enable them at the global level—in the DataSource definition itself. Global enabling is recommended when a majority of the instances in your infrastructure will benefit. Alternately, if dynamic thresholds are only beneficial when applied to a small number of instances across resources, you would enable at the instance level.
Dynamic thresholds cascade down from the global DataSource level. However, if alternate dynamic threshold configurations are encountered at deeper levels in the Resources tree, those deeper configurations will override those found at higher levels. For example, dynamic thresholds set at the instance level will override those set at the global DataSource level.
Note: The dynamic thresholds for a datapoint's expected range are calculated based on the previous three days' of data. This means that dynamic thresholds have to be enabled for a datapoint for three days before any alert notification routing suppression takes place.
Enabling at the Global Level
Global-level enablement for dynamic thresholds takes place in the DataSource definition. The DataSource definition can be accessed by navigating to Settings | DataSources or by clicking the Edit Global Definition hyperlink that is available when viewing DataSource or instance data from the Resources tree.
From the edit view of the DataSource definition, you're able to view and edit all datapoints associated with the DataSource. Find the datapoint for which you want to enable dynamic thresholds, click its manage icon, and, from the Edit Datapoint dialog, toggle the Dynamic Thresholds slider to the right in order to select the alert severities for which you'd like to enable dynamic thresholds.
Enabling at the Instance Level
Instance-level enablement for dynamic thresholds takes place on the Resources page. As discussed next, there are different entry points, depending upon whether the owning DataSource is a single- or multi-instance DataSource and whether multiple instances, when present, are organized into instance groups.
Enabling Dynamic Thresholds for a Single Instance
To enable dynamic thresholds for a single-instance DataSource (and thus a single instance), navigate to the DataSource in the Resources tree and open the Alert Tuning tab. To enable for a single instance of a multi-instance DataSource, navigate directly to the instance itself in the Resources tree and open the Alert Tuning tab. Once at the Alert Tuning tab, find the datapoint for which you would like to enable dynamic thresholds, and click the pencil icon found in the "Effective Threshold" column. This opens the threshold wizard, which features dynamic threshold settings.
By default, instances will be set to inherit the dynamic threshold settings assigned to their parent DataSource (i.e. settings from the global DataSource level are used). To override these at the instance level, select "Custom" from the dropdown menu and place a checkmark next to the alert severities for which you'd like to enable dynamic thresholds.
For more information on using the threshold wizard, which allows you to update static datapoint thresholds in addition to enabling dynamic thresholds, see Tuning Static Thresholds for Datapoints.
Enabling Dynamic Thresholds for Multiple Instances at Once
In addition to enabling dynamic thresholds for a single instance, you can also enable dynamic thresholds for multiple instances at once. This saves time if your end goal is to enable for all instances (or a subset of instances, called an instance group) found on a resource. For more information on instance groups, see Instance Groups.
To enable multiple instances at once, use the Resources tree to navigate to either the multi-instance DataSource (assuming you want to enable dynamic thresholds for all instances on the resource at once) or one of its instance groups (assuming you want to enable for only a subset of the instances). Open the Alert Tuning tab and configure the dynamic threshold settings, as discussed in the previous section.
Viewing Alerts with Suppressed Notifications
Alerts whose notifications have been suppressed via dynamic thresholds still display as usual in the LogicMonitor interface. As shown next, they are denoted with a dedicated icon and feature an informational tag that indicates alert notification routing suppression.
Alerts whose notifications have been suppressed via dynamic thresholds display an icon depicting a bell with a slash through it and feature additional information indicating alert notification routing suppression.
When viewing alerts from the Alerts page (or when configuring the Alert List widget or Alerts report), you can use the Anomaly filter to limit alert display to only those alerts to which dynamic threshold analysis was applied.
- Anomaly = No. When the Anomaly filter is set to "No", only those alerts whose notifications were not routed as a result of evaluating dynamic thresholds are displayed.
- Anomaly = Yes. When the Anomaly filter is set to "Yes", only those alerts whose notifications were routed (i.e. they were deemed anomalous) after dynamic threshold analysis was applied are displayed.
- Anomaly = All. When the Anomaly filter is set to "All", the alert table returns to its default behavior of displaying all alerts regardless of anomaly status.
A Closer Look at How Dynamic Thresholds Work
The dynamic thresholds for a datapoint's expected range are calculated based on the previous three days' of data. This means that dynamic thresholds have to be enabled for a datapoint for three days before any alert notification routing suppression takes place.
If dynamic thresholds haven't been enabled for three days (or there is a significant amount of missing data during the previous three days), LogicMonitor will not attempt to determine if the datapoint value is anomalous; rather, alert notifications will be routed as usual until enough data is present.
Assuming there is enough data for an expected range to be accurately defined, LogicMonitor evaluates the datapoint value for a triggered alert against the expected range. If the value falls within the expected range, the alert is considered to represent non-anomalous data and its subsequent notification is not routed. If the value falls outside of the expected range, the alert is considered to represent anomalous data and its subsequent notification is routed.
In this anomaly detection graph, the static threshold for the datapoint is set at >100,000,000 nanoseconds. Although many values exceed this threshold over the course of the 20 hours depicted here, the majority of them still fall within the expected range, which is shaded in blue. If dynamic thresholds were enabled for this datapoint, only those alerts triggered by the red values (i.e. those values surpassing the upper bounds of the expected range) would have their notifications routed; all other alert notifications would be suppressed.