At LogicMonitor, we talk to customers and prospects constantly. This enables us to observe trends in requests for new features and functionality.
There are two major topics that prospects and customers have been raising recently:
AIOps (see this blog):
It’s such an amorphous topic that people aren’t exactly sure what they want. They mainly want to know that we are working on it. Because AIOps is such a popular term, practically all monitoring vendors have adopted the phrase for marketing and positioning, even though the same concepts have been discussed (and, in some cases, promised) for years. Note: a similar behavior is seen in the AI world - where 40% of "AI startups" don't actually use AI…
Unlike AIOps, people know what dynamic thresholds are and believe it will have a concrete, immediate effect on their operations. Most people are correct about the second topic, dynamic thresholds, but I believe the effect will be the opposite of what they expect: most applications of dynamic thresholds will increase alert noise, create alerts that are not meaningful, and increase the chance of major alerts not being investigated! Read on to see why
What are dynamic thresholds?
There are a few different aspects of “dynamic thresholds”.
Time-based alert thresholds
Some companies define dynamic thresholds as the thing that, in LogicMonitor terminology, we call time-based thresholds. These are different sets of thresholds that apply at different times of the day. This is a less common usage of dynamic thresholds - not what the rest of this article is about. LogicMonitor has supported time-based thresholds, and, separately, time-based alert routing--sending the same alert to different destinations based on time of day--for many years.
This is what most people mean when they talk about dynamic thresholds: the ability of the monitoring system to dynamically change the thresholds that the reported data is being compared against, based on the historical values of that data. The hope is that this dynamic thresholding will alleviate the need to set thresholds, reduce alert noise, and more quickly alert on significant issues, by identifying cases where the metric is behaving anomalously with respect to its history.
In many cases, anomaly detection will provide less value (in terms of efficient, meaningful alerts) than static thresholds. There are certainly cases that dynamic thresholds (or anomaly detection) can address that static thresholds cannot, but it is important to recognize the appropriate use cases for each.
The simplest form of anomaly detection is automatic baselining. This is using the historical range of a datapoint to set (or suggest) static thresholds that encompass “most” of the historical range of the datapoint. e.g. if a CPU ranges from 1% to 26%, an automatic baselining system may set thresholds to alert if the datapoint is less than 1%, or exceeds 26%.
Automatic baselines may be recalculated periodically, such as being automatically updated weekly based on new data. (Which can lead to the amusing-if-it-happens-to-someone-else outcome of the thresholds for slowly increasing disk usage also slowly increasing, until they regard the "normal" range as 95% to 110% - so the disk becomes completely full, but no alerts are triggered as the data is “normal” with respect to recent trends.)
More sophisticated anomaly detection will use historical data to forecast expected ranges for the datapoint, taking into account short term and longer term periodicity, to create a more specific range of what is the expected value of the datapoint, and alerting if the datapoint is outside that range (or if the trend for that datapoint is different than historical trends.)
(Note: in this article, I am only talking about triggering alerts based on anomaly detection - there are other great uses for anomaly detection in speeding up root-cause determination, which I am not discussing - yet!)
Rules are better for alerting about resources than anomaly detection
Superficially, this anomaly detection may seem beneficial - the user will be alerted if the measured CPU datapoint moves outside the historical or predicted “normal” range. However - the most commonly reported dissatisfaction with monitoring is alert fatigue and alert overload. Companies often have many thousands of alerts triggered a day, across various monitoring systems. Given this, the prospect of alerts being triggered because CPU has reached 27%, exceeding the usual range, is unlikely to be of much value. After all, the device in question still has 73% headroom of extra CPU capacity. CPU usage could double, and while that may be anomalous, the device would still function perfectly well. Such alerts will likely only contribute to alert fatigue and increase the chance that actionable alerts may be ignored.
Applications depend on their underlying resources (CPU, memory, network bandwidth, TCP buffers, disk IO capacity, etc) to perform. If those underlying resources are not constrained, there is little point in alerting on changes in behavior so long as there is no constraint. Detecting constraints in underlying resources is best done not by looking for changes in behavior, but by comparing measured values with actual knowledge of when resource constraints occur. For example, the number of active threads approaching 95% of the configured maximum in a Java application. Where the resource constraint can be expressed as a rule (e.g. disk usage is not constrained if it is less than 95% full) then a rule, not an anomaly, is the best way to detect it.
This is why LogicMonitor has embedded knowledge about thousands of different technologies and applications into the platform, with best practices alerting thresholds for various resource constraints built in. If the behavior of a metric is approaching a constraint (disk latency too high, too many memory pages being swapped, etc.), then alerts will be triggered. But changes in behavior that don’t affect performance? That’s just noise.
Anomaly detection without dependency knowledge can substantially exacerbate the situation of being flooded with meaningless alerts. Applications today are composed of a myriad of interconnected services and databases. Hitting a resource constraint in Zookeeper may cause impacts on many dependent systems. Of course, all those dependent systems will behave anomalously - something they depend on is out of resources, so they will be slower or cease functioning themselves. Alerting on all the dependent anomalies will just make the real root cause that much harder to find.
Where anomaly detection and alerting is valuable
There are situations where anomaly detection is very valuable. One of them is detecting unusual changes in behavior in the opposite direction of constraints. Example: a rule can capture that you may be approaching a constraint if CPU utilization is greater than 90 percent. You don't need anomaly detection to notify you if CPU went up unusually. If CPU is still below 90%, you don’t care. And if it’s above 90%, you’d be notified by a static alert threshold, so you also don’t need anomaly detection. However, if CPU is usually 70% at this time of day, and suddenly drops and sustains at 4% - that would not be alerted on by default static thresholds. (Although such thresholds could be created, they do not exist by default, and constructing them would add additional administrative overhead.) That is a situation that you may want to be alerted on - it may indicate that web requests are significantly down, possibly because of a firewall misconfiguration or another issue that warrants investigation. The trouble with these alerts is that they may also contribute to alert fatigue, without being actionable. Alerting on anomalies that are "unusually unconstrained" should be done frugally, when a deviation is a large degree from what is expected, for a reasonably long period of time, in order to reduce false positives.
Another case where anomaly detection is useful is where there are no default thresholds, or natural constraints - for example, for the requests per second for a custom application. Requests per second, if the component resources are not constrained, is going to reflect demand. But if requests per second drop dramatically (which may indicate a network issue, preventing requests arriving) or rise dramatically (which may indicate requests failing and being retried repeatedly), it may be the case that no constraints are being hit, but there are still issues that need investigation.
Note that static thresholds could be configured for such a custom application - in both the upper and lower directions - to trigger alerts in these cases. However, anomaly detection can clearly save time by eliminating the need for threshold configuration.
So in summary, if you have clear resource constraints that can be expressed as a rule (an acceptable threshold of CPU in use, or query cache hit rate, etc) then use static thresholds. However, detecting changes in the opposite direction to the static tests can be useful (detecting drops if the static thresholds test for increases, or vice versa). And anomaly detection is well suited for cases where you wish to monitor the behavior of a metric, but there are no natural constraints.
And in all cases, anomaly detection should be topology aware, and not trigger anomaly alerts that are a result of a static threshold being triggered.
Ultimately, you want your monitoring software to automate many of the operators' tasks, so they can focus on higher value. Whether this is by embedding knowledge of resources, automating anomaly detection, identifying which metrics on a service are not anomalous during an incident (to eliminate having to investigate them), automating the remediation of actions, or other methods - this is what LogicMonitor is working to deliver to our customers.