Come join our live training webinar every other Wednesday at 11am PST and hear LogicMonitor experts explain best practices and answer common questions. We understand these are uncertain times, and we are here to help!
FEATURE AVAILABILITY: Root cause analysis is available to users of LogicMonitor Enterprise.
Root Cause Analysis (RCA) leverages the auto-discovered relationships among your monitored resources, as discovered by LogicMonitor’s topology mapping AIOps feature, to determine the root cause of an incident that is impacting dependent resources.
When enabled for your alerting operations, root cause analysis highlights the originating cause of the incident, while optionally suppressing notification routing for those alerts determined to be dependent on the originating alert. This can significantly reduce alert noise for events in which a parent resource has gone down or become unreachable, thus causing dependent resources to go into alert as well.
During an alert storm, many alerts relating to the same originating incident are raised in LogicMonitor and a slew of notifications may be sent out based on alert rule settings for each metric threshold that is exceeded. This can result in a flood of notifications for resources affected by the incident without a clear indication of which resources are the root cause of the incident.
Enabling root cause analysis addresses this issue through the following process:
In order for root cause analysis to take place, the following requirements must be met.
In order to trigger root cause analysis, a resource must be unreachable or down, as determined by an alert of any severity level being raised on the following datapoints:
Note: Root cause analysis is currently limited to resources and does not extend to instances. For example, a down interface on which other devices are dependent for connectivity will not trigger root cause analysis.
Root cause analysis relies on the relationships between monitored resources. These relationships are automatically discovered via LogicMonitor’s topology mapping feature. To ensure this feature is enabled and up to date, see Topology Mapping Overview.
Root Cause Analysis has the following performance limits:
Every set of root cause analysis configurations you create is associated with one or more entry points. As discussed in detail in the Dependency Chain Entry Point section of this support article, an entry point is the resource at which the dependency chain begins (i.e. the highest level resource in the resulting dependency chain hierarchy); all resources connected to the entry-point resource become part of the dependency chain and are, therefore, subject to root cause analysis if any device upstream or downstream in the dependency chain becomes unreachable.
The ability to configure different settings for different entry points provides considerable flexibility. For example, MSPs may have some clients that permit a notification delay but others that don’t due to strict SLAs. Or, an enterprise may want to route dependent alerts for some resources, but not for others.
To configure root cause analysis, select Settings | Alert Settings | Root Cause Analysis | Add. A dialog appears that allows you to configure various settings. Each setting is discussed next.
In the Name field, enter a descriptive name for the configuration.
In the Priority field, enter a numeric priority value. A value of “1” represents the highest priority. In the event that multiple configurations exist for the same resource, this field ensures that the highest priority configurations are used. If you are diligent about ensuring that your entry-point selections represent unique resources, then priority should never come into play. The value in this field will only be used if coverage for an entry point is duplicated in another configuration.
In the Description field, optionally enter a description for the configuration.
Under the Select Entry-Point for Topology Based Dependency configuration area, click the plus sign (+) icon to add one or more groups and/or individual resources that will serve as entry point(s) for this configuration. For either the Group or Resource field, you can enter a wildcard (*) to indicate all groups or all resources. Only one of these fields can contain a wildcard per entry point configuration. For example, selecting a resource group but leaving resources wildcarded will return all resources in the selected group as entry points.
The selection of an entry-point resource uses the topology relationships for this resource to establish a parent/child dependency hierarchy (i.e. dependency chain) for which root cause analysis is enabled. If any resource in this dependency chain goes down, it will trigger root cause analysis for all alerts arising from members of the dependency chain.
Once saved, all dependent nodes to the entry point, as well as their degrees of separation from the entry point, are recorded in the Audit Log, as discussed in the Root Cause Detail Captured by Audit Log section of this support article.
Note: The ability to configure a single set of root cause analysis settings for multiple entry points means that you could conceivably cover your entire network with just one configuration.
When possible, you should select the Collector host as the entry point. As the location from which monitoring initiates, it is the most accurate entry point. However, if your Collector host is not in monitoring or if its path to network devices is not discovered via topology mapping, then the closest device to the Collector host (i.e. the device that serves as the proxy or gateway into the network for Collector access) should be selected.
In a typical environment, you will want to create one entry point per Collector. The following diagrams offer guidelines for selecting these entry points.
When the Collector host is monitored, and its path to network devices is discoverable via topology, it should be the entry point, regardless of whether it resides inside (illustrated in top example) or outside (illustrated in bottom example) the network.
If the Collector host is not monitored, then the device closest to the Collector host, typically a switch/router if the host is inside the network (illustrated in top example) or a firewall if the host is outside the network (illustrated in bottom example), should be selected as the entry point.
If the Collector host is monitored, but its path to network devices is not discoverable via topology, then the device closest to the Collector host that is both monitored and discovered should be selected as the entry point.
Note: To verify that topology relationships are appropriately discovered for the entry point you intend to use, open the entry point resource from the Resources page and view its Maps tab. Select “Dynamic” from the Context field’s dropdown menu to show connections with multiple degrees of separation. For more information on the Maps tab, see Maps Tab.
The selection of an entry-point resource establishes a dependency hierarchy in which every connected resource is dependent on the entry point as well as on any other connected resource that is closer than it is to the entry point. This means that the triggering of root cause analysis is not reliant on just the entry point becoming unreachable and going into alert. Any node in the dependency chain that is unreachable and goes into alert (as determined by the PingLossPercent or idleInterval datapoints) will trigger root cause analysis.
In this example dependency chain, node 1 is the entry point and nodes 2-8 are all dependent on node 1. But other dependencies are present as well. For example, if node 2 goes down and, as a result, nodes 4, 6 and 7 become unreachable, root cause analysis would consider node 2 to be the originating cause of the alerts on nodes 4, 6 and 7. Node 2 would also be considered the direct cause of the alert on node 4. And node 4 would be considered the direct cause of the alerts on nodes 6 and 7. As discussed in the Alert Details Unique to Root Cause Analysis section of this support article, originating and direct cause resource(s) are displayed for every alert that is deemed to be dependent.
Use the following options to suppress notification routing for dependent alerts during a root cause analysis incident:
Most likely, you’ll want to check both options to suppress all dependent alert routing and release only those alerts determined to represent originating cause. However, for more nuanced control, you have the ability to disable only reachability alerts—or only non-reachability alerts. This may prove helpful in cases where different teams are responsible for addressing different types of alerts.
Note: If you want to verify the accuracy of originating and dependent alert identification before taking the potentially risky step of suppressing alert notifications, leave both of these options unchecked to begin with. Then, use the root cause detail that is provided in the alert, as discussed in the Alert Details Unique to Root Cause Analysis section of this support article, to ensure that the outcome of root cause analysis is as expected.
By default, the Enable Alert Routing Delay option is checked. This delays alert notification routing for all resources that are part of the dependency chain when an alert triggers root cause analysis, allowing time for the incident to fully manifest itself and for the algorithm to determine originating cause and dependent alerts. As discussed in the Viewing Dependent Alerts section of this support article, an alert’s routing stage will indicate “Delayed” while root cause conditions are being evaluated.
If routing delay is enabled, the Max Alert Routing Delay Time field is available. This field determines the maximum amount of time alert routing can be delayed due to root cause analysis.
If evaluation is still occurring when the maximum time limit is reached (or if the Enable Alert Routing Delay option is unchecked), notifications will be routed with whatever root cause data is available at that time. In the event of no delay being permitted, this will likely mean that no root cause data will be included in the notifications. However, in both cases, as the incident manifests, the alerts will continue to evolve which will result in additional information being added to the alerts and, in the case of those alerts determined to be dependent, suppression of additional escalation chain stages.
Note: Reachability or down alerts for entry point resources are always routed immediately, regardless of settings. This is because an entry-point resource will always be the originating cause, making it an actionable alert and cause for immediate notification.
Alerts that undergo root cause analysis display as usual in the LogicMonitor interface—even those whose notifications have been suppressed as a result of being identified as dependent alerts. As discussed in the following sections, the Alerts page offers additional information and display options for alerts that undergo root cause analysis.
The Alerts page offers three columns and two filters unique to the root cause analysis feature.
The Routing State column displays the current state of the alert notification. There are three possible routing states:
The Dependency Role column displays the role of the alert in the incident. There are three possible dependency roles:
The Dependent Alerts column displays the number of alerts, if any, that are dependent on the alert. If the alert is an originating alert, this number will encompass all alerts from all resources in the dependency chain. If the alert is not an originating alert, it could still have dependent alerts because any alert that represent resources downstream in the dependency chain are considered to be dependent on the current alert.
LogicMonitor offers two filters based on the data in the Routing State and Dependency Role columns. The criteria for these filters lines up with the values available for each column.
When viewing the details of an alert with dependent alerts (i.e. an originating cause alert or direct cause alert), a Dependencies tab is additionally available. This tab lists all of the alert’s dependent alerts (i.e. all alerts for resources downstream in the dependency chain). These dependent alerts can be acknowledged or placed into scheduled downtime (SDT) en masse using the Acknowledge all and SDT all buttons.
The alert details for an alert that is part of a root cause analysis incident carry additional details related to root cause—these details are present in both the LogicMonitor UI and alert notifications (if routed).
The alert type, alert role, and dependent alert count carry the same details as the columns described in a previous section. If the alert is not the originating alert, then the originating cause and direct cause resource names are also provided. Direct cause resources are the immediate neighbor resources that are one step closer to the entry point on which the given resource is directly dependent.
Root cause details are also available as tokens. For more information on using tokens in custom alert notification messages, see Tokens Available in LogicModule Alert Messages.
Represents all dependency details that accompany alerts that have undergone root cause analysis
Approximately five minutes after saving a root cause analysis configuration, the following information is captured in LogicMonitor’s audit logs for the “System:AlertDependency” user:
Nodes In Dependency(type:name(id):status:level:waitingStartTime:EntryPoint):
For more information on using the audit logs, see About Audit Logs.
Using the entry point and dependent nodes detail captured by the audit logs (as discussed in the previous section), you may want to consider building out a topology map that represents the entry point(s) and dependent nodes of your root cause analysis configuration. Because topology maps visually show alert status, this can be extremely helpful when evaluating an incident at a glance. For more information on creating topology maps, see Mapping Page.
Like many other features in the LogicMonitor platform, root cause analysis supports role-based access control. By default, only users assigned the default administrator or manager roles will be able to view or manage root cause analysis configurations. However, as discussed in Roles, roles can be created or updated to allow for access to these configurations.
In This Article