Collector Failover and Failback
IN THIS ARTICLE:
Every Collector (that is not a member of an Auto-Balanced Collector Group) should have a failover Collector assigned to it. Failover Collectors eliminate the Collector as a single point of failure, ensuring monitoring continues should a Collector go down.
If a Collector is declared down (a Collector is declared down when LogicMonitor's servers have not heard from it for three minutes), all devices monitored by the down Collector will automatically fail over to the failover Collector, assuming one is designated. Once the down Collector comes back online, failback can take place automatically (if automatic failback is enabled for the Collector) or manually.
Note: In addition to supporting the one-to-one Collector failover/failback method discussed throughout this support article, LogicMonitor also supports failover/failback within the context of Auto-Balanced Collector Groups (ABCGs). The Collectors in ABCGs share device load and support dynamic failover. For more information on ABCGs, see Auto-Balanced Collector Groups.
Designating a Failover Collector
Because the failover Collector will take over all monitoring for the down Collector, it's important to ensure that the two Collectors (the original preferred Collector and the failover Collector) are well matched. In other words, the failover Collector must have the same data collection abilities and configurations as the original Collector. For example, both Collectors should be listed as exceptions for any firewalls restricting access to the monitors hosts; both Collectors must be permitted in any specific snmpd.conf, ntp.conf or other configuration settings that may limit monitoring access; and both Collectors must be on the same operating system (e.g. Linux or Windows).
For this reason, LogicMonitor recommends that you configure failover Collectors in pairs (i.e. Collector A fails over to Collector B and Collector B fails over to Collector A). As this recommendation implies, failover Collectors can also be assigned their own sets of monitoring tasks.
To designate a failover Collector:
- Install/identify a Collector residing on a different server that is capable of monitoring the same set of devices as the Collector for which you are designating a failover Collector.
- From the original Collector's Manage Collector dialog (navigate to Settings | Collectors | Manage), select the failover Collector from the Failover Collector field's dropdown menu.
- Once a failover Collector is designated, two options display:
- Resume using Preferred Collector when it becomes available again. If left checked, automatic failback to the Collector is enabled, as discussed in the Automatic Failback to Original Collector section of this support article. If unchecked, failback will need to be manually initiated, as discussed in the Manual Failback to Original Collector section of this support article.
- Exclude <resource name> from failover actions. If left checked (recommended), the Collector device is excluded from failover. Because Collectors monitor themselves, this is most likely desirable as it will preserve Collector metrics.
If a Collector is declared down, all devices monitored by the down Collector will automatically fail over to the failover Collector, assuming one is designated.
Note: A Collector is declared down and thus enters failover when LogicMonitor's servers have not heard from it for three minutes. (The time window for initiating failover is governed by multiple, complex processes, and there may be slight differences in timing for different cases.)
Note: You will be notified of a Collector failover event even if the Collector is in SDT.
Once the down Collector comes back online, failback can take place automatically (if automatic failback is enabled for the Collector) or manually.
Automatic Failback to Original Collector
To enable automatic device failback to the original Collector, navigate to the Collector's Manage Collector dialog (Settings | Collectors | Manage) and check the Resume using Preferred Collector when it becomes available again option. As discussed in the Designating a Failover Collector section of this support article, this option is only available if a failback Collector is designated in the Failover Collector field.
Note: LogicMonitor will wait eight minutes after a Collector has resumed functioning to initiate automatic failback to it. (The time window for initiating failover is governed by multiple, complex processes, and there may be slight differences in timing for different cases.)
Manual Failback to Original Collector
If you choose not to enable automatic failback for a Collector, then you'll need to manually reassign devices back to the original Collector once it is back online. This can be done by navigating to Settings | Collectors | Resources from either the original Collector that went down or the failover Collector.
When manually failing back from the original Collector's resources list, you have the option to assign all devices back to the original Collector, or permanently assign them to the failover Collector.
When manually failing back from the failover Collector, you have a little more flexibility as you are able to fail back all or a subset of devices back to the original preferred Collector, as well as assign all or a subset of devices to any new preferred Collector. The ability to assign a Collector's devices to a new preferred Collector can be done at any time; it is not limited to the aftermath of a failover event.