Help! My Collector is Down: Troubleshoot in 6 Steps

Help! My Collector is Down: Troubleshoot in 6 Steps

At the core of the LogicMonitor solution, there is the LogicMonitor Collector. The Collector is a small Java app installed on servers in your environment that collects monitored data from your various devices and then sends that data to LogicMonitor for retention and display. The Collector is what connects your environment to the cloud and allows you access from anywhere. However, sometimes these Collectors can go down, potentially leading to gaps in monitoring. Obviously, this is an issue you want to resolve quickly, but you may not know how. While the LogicMonitor support team is always here to help, you may be able to resolve the problem yourself far faster with a good troubleshooting strategy and an understanding of what the LogicMonitor Collector needs to function. This guide is designed to give you a better understanding of how the Collector works, and how to use that knowledge to get a down Collector back up and running as well as some best practices to maximize the resilience of your Collectors. 

To understand how to get a Collector back up and running, it’s important to understand how the Collector is declared up or down. Collector down alerts are created when LogicMonitor’s cloud servers have not received data from a Collector for more than 5 minutes. There are some common root causes that lead the majority of Collector down occurrences and the steps ahead walk you through how to find and remedy each of them.  

Check the Services

First, check that the host machine where the Collector is installed is functional and then ensure both the LogicMonitor Collector and Watchdog services are up and running. Normally the Watchdog service will restart the Collector service if it stops, but there are situations where that may not happen so it’s important to verify that both services are running on the host machine. If the Collector and Watchdog services are unable to start, there are a few common things you can check. 

Check the Credentials 

A common issue for Windows Collectors happens when the credentials for the Collector and Watchdog services have insufficient permissions. The supported Windows credential configurations are as follows. 

  1. Collector and monitored resources in the same domain, Collector and Watchdog services running as a domain account with local administrator privileges.
  2. Collector and monitored resources not in the same domain, Collector and Watchdog services running as local administrator account, and connecting to each host with local administrator credentials (set using wmi.user and wmi.pass properties in LogicMonitor). 

Additionally the LogicMonitor Collector service must be granted “Log on as a service” under “Local Policy/User Rights Assignment” in the host OS local security policy settings.

Check the Connection

Once you’ve confirmed both the Collector and Watchdog services are running, the next thing is to check if those services can communicate with LogicMonitor’s cloud servers. The Collector uses port 443 and the HTTPS/TLS protocols to communicate with LogicMonitor’s data centers. A quick way to check if your Collector can make an outgoing connection to a LogicMonitor server is to access your LogicMonitor portal (https://<company>.logicmonitor.com) from a web browser on the Collector host. We recommend following the whitelist procedures as outlined in this support center article. While we recommend DNS based whitelisting where possible, the IP ranges on this page are kept up to date and should be regularly reviewed and updated on the firewalls in your environment to ensure communication is maintained. Keep an eye on our Release Notes for updates to our Whitelist.

Review AntiVirus Software

After you’ve determined the Collector and Watchdog services are up and running and the Collector host is able to communicate with LogicMonior’s servers, the next thing is to review any AntiVirus software running on the Collector host. On Windows Collectors, you will need to ensure that the LogicMonitor directory C:\Program Files (x86)\LogicMonitor\ is added as an exclusion to any AntiVirus software recursively. AntiVirus software can incorrectly flag the Collector services as a problem and prevent them from running or communicating with LogicMonitor or monitored devices. More details about our Security Best Practices can be found here.

Once you have performed the above steps, there is a good chance your Collector is back up and running. This is a great time to take note of what the root cause of your issue was and how it may be prevented in the future. If something has changed in your environment like a firewall rule or AntiVirus configuration, it may be a good time to make sure that you’re update and change processes include provisions to ensure continued monitoring functionality. If the steps above do not get your Collector back up and running, please contact LogicMonitor support and an engineer will assist you in collecting more detail on the issue and review the Collector logs to identify and correct the problem.

Review Collector Health

Another useful tool in LogicMonitor to understand Collector behavior is the Collector Status. This can be accessed by going to the Collector in your LogicMonitor Portal, clicking the Manage icon and then the Support drop down. 

The Collector Status Option when managing a collector can help troubleshoot collector issues.

Collector Status is a great place to check on Collector health. It can indicate potentially problematic load issues and LogicModules with abnormally high numbers of failed polls. 

The top of the Collector Status gives a quick overview of the status of the varying metrics that make it up. Warning and Error status items should be investigated further.
The various metrics that make up Collector Status can indicate potential load related problems before they become a problem. These change color to indicate potential problems and contain helpful messages.

Collector Status is not intended to be a complete view of Collector performance, but it is an excellent tool to quickly and efficiently identify the sources of unexplained issues. Highlighted issues can point you to areas of concern and help determine if your current Collector configuration is well suited to the monitoring load it is expected to handle. 

Also accessible from the Support menu for a Collector are Collector Events. This list tracks Collector restarts and can surface certain types of errors a Collector may encounter as reported by Watchdog. This is especially useful when looking for a Collector that goes down more than once or has a recurring behavior you’re looking into. Collector Events is great for finding patterns and can aid in understanding the daily behavior of a Collector. 

Collector Events for a healthy Collector showing it’s daily restart and credential rotation.

Setting Up Resilient Monitoring

This is also a great time to consider your current Collector failover configuration. While a backup collector is a great choice, you may also want to consider Auto-Balanced Collector Groups. An Auto-Balanced Collector Group allows load balancing between a group of Collectors and even better fault tolerance, should one of the Collectors in the group go down. You can read more about Auto-Balanced Collector Groups here

You should now have a much better understanding of the LogicMonitor Collector, how it works, and how to maintain it within your environment. You also got to see some of the tools available in your LogicMonitor environment that can be used to help understand Collector performance. Use this knowledge to configure your Collectors with redundancy to ensure continuous data collection. Collector down alerts can often be resolved in just a few minutes with the steps discussed. Go forth, and monitor with confidence!

A Brief Summary of the Steps Above

Below is a handy list of what to do if a Collector goes down. It summarized the steps discussed above. Feel free to use this as a reference if you encounter a down Collector. 

  1. Ensure the Collector and Watchdog Services are both running. 
  2. Check that the Collector host can communicate with LogicMonitor servers over Port 443 via HTTPS/TLS. This means checking both internet connectivity and that your whitelist is up to date. 
  3. Make sure the credentials for the Collector and Watchdog services have sufficient, supported permissions.
  4. Make sure AntiVirus software isn’t preventing the Collector services from functioning. 
  5. Review Collector Status and Collector Events.
  6. Be sure to set up a backup Collector or an Auto-Balanced Collector Group to ensure monitoring can be maintained if a Collector does go down.