Devices

Cluster Alerts

Sometimes it's less important to monitor and alert on every single device than it is to know the overall state of a collection of devices. If you group your devices strategically, you can use cluster alerts to monitor and alert on a datapoint across multiple devices in a group.  This is useful when you have a pool of devices that are serving an application or performing a task - you may not be too concerned with an issue that affects one device, but you likely want to know immediately if the pool of devices as a whole is at risk of not being able to serve its purpose.  For example, you may want to configure a Cluster Alert to trigger when more than 5 batch servers have a CPU that is over 80% busy.

Note that Cluster Alerts are configured based on the presence of multiple individual alerts for a group of devices (e.g. more than 4 devices have error alerts for the CPU Busy Percent datapoint), and therefore you'll still need to keep your individual device thresholds tuned.

Adding a cluster alert

Navigate to the Cluster Alerts Tab for a device group and select the add button from the table header:

Adding a cluster alert

A cluster alert is based on the performance of a specific datapoint across devices in a group.  To configure an alert, you will set:

  • The level of the alert (warning / error / critical) to be included.  This is determined by the datapoint threshold.
  • How many devices or instances must meet the condition to generate an alert.  (More than, exactly, or less than).  This can be a total or percentage calculation.
  • The datasource to be evaluated
  • The datapoint to be evaluated

Example cluster alert configuration

Example cluster alert configuration

In the above example, a cluster alert will be generated if more than 5 devices in the group have an active warning alert for the CPU datapoint on instances of the CPU Cores- datasource. An error alert will be triggered if 7 devices enter into alert. A critical alert will be generated if 9 of those devices enter into alert.

Note that cluster alerts behave in the same manner as our datasource alerts. You will need to set distinct warning, error, and critical thresholds in order to receive alerts with these respective statuses. For instance, if you set a cluster alert to "warning" when three CPU instances are in error, the cluster alert status you will receive will be "warning" even though the individual datapoints are in "error."

Managing cluster alerts

Managing cluster alerts

An overview of all active cluster alerts is available in the cluster alerts tab.  Alerting and configuration can be managed from the table.

Enable

The enable checkbox turns on and off alert generation for the Cluster Alert.  If left unchecked, data will still be collected but no alerts will be generated for the Cluster Alert.

Datasource name

Instances of this datasource will be evaluated in the cluster alert.

Trigger

The minimum alert level for a datapoint alert to contribute towards the cluster alert threshold.

Threshold

The number or percent of instances or devices that must match the cluster criteria for an alert to be generated.

Enable datapoint alerts

If checked, individual datapoint alerts will be generated (when the cluster alert threshold is met you will see individual datapoint alerts and the cluster alert).  If unchecked, all matching datapoint alerts will still be logged, but they will not generate notifications.

Manage

Selecting manage will display the cluster alert configuration dialog.

Viewing and routing cluster alerts

Cluster alerts will appear in the group Alerts Tab, as well as on the Alerts Page of your account.  The device name will always be "cluster".  This information can also be used for alert routing, filtering, and reporting - you'll need to have an alert rule configured to match critical alerts for the group the cluster alert is configured on in order to match the cluster alert (you can optionally set an alert rule to match the device "cluster", if you'd like to route cluster alerts differently).


The above configuration will result in all cluster alerts being routed to the on-call engineer and all other Web Servers* group alerts being routed to a member of the servers team.