From Settings | Collectors you can control how much information is logged by your collector and how long these log files are retained.  

 

Adjusting log levels

You may want to adjust log levels to increase how much information is logged to debug an issue, or to decrease how much information is logged to save disk space.  Select the Logs icon for the desired collector (from Settings | Collectors) and then select manage to see the log levels on a per component basis for that collector:

Adjusting log levels

The log level for each collector component controls what information is logged for that component.  Available log levels are:

  • trace – this log level is the most verbose, and will log every action of the collector in full detail.  Note that this can use a significant amount of disk space if your collector is actively monitoring a large number of devices, and as such is typically only recommended for debugging purposes.
  • debug – detailed information about collector tasks will be logged (not as much information as trace log level).  The debug log level can make it easier to identify an issue and track down the root cause.
  • info – this is the default log level, and will log basic information about collector tasks.
  • warn – information will only be logged when something isn’t quite right, but it may not be causing an issue yet.
  • error – information will only be logged when something is wrong.
  • disable – no information will be logged.

 

As an example, you might write a script datasource and your collector is getting no data, but you can’t figure out the problem.  You could increase the log level for the collector.script component to debug or trace and then look at the logs (either using the collector debug facility or on the collector machine itself) to troubleshoot the issue.

 

Changing log file retention

Collector log files are rotated based on size, not date.  By default, there are 3 log files of 64M each.  If you’d like to change these numbers, you can do so in the wrapper.conf file (in the conf directory) where the collector is installed.  You can edit the wrapper.conf file in the conf directory on the collector machine itself, OR you can edit the file directly from your LogicMonitor account UI.  Navigate to Settings | Collectors and select manage for the desired collector.  From the dropdown menu select Collector Configuration and then select the Wrapper Config tab.  Locate the Wrapper Logging Properties and change these values (make sure to override the Wrapper.config before saving and restarting):

Changing log file retention

Sending logs to LogicMonitor

From the Manage dialog you can send your logs to LogicMonitor support.  This might be useful if you are collaborating with our support team and would like them to be able to look through your collector log files.  Select the manage gear icon for the desired collector and then select ‘Send logs to LogicMonitor’:

Sending logs to LogicMonitor

Logs that you’ve sent to LogicMonitor support will be displayed in the Logs section for that Collector.  Filter based on time range to limit what is displayed:

Overview

The LogicMonitor Collector is the heart of your monitoring system. As such, it’s important that you monitor your Collectors to ensure that performance is keeping up with data collection load. Equally important is ensuring the least disruption possible when a Collector does go down. This includes making sure timely notifications are delivered to the appropriate recipient(s).

As best practice, LogicMonitor recommends that you (1) set up monitoring for your Collectors and (2) configure notification routing for Collector down alerts.

Adding the Collector Host into Monitoring

If it isn’t already part of your monitoring operations, add the device on which the Collector is installed into monitoring. This will allow you to keep tabs on CPU utilization, disk space and other metrics important to smooth Collector operation. For more information on adding devices into monitoring, see Adding Devices.

Enabling Collector DataSources on the Host

LogicMonitor provides a series of built-in Collector DataSources that provide insight into a Collector’s operations, performance, and workload. In most cases, these Collector DataSources will be automatically applied to the Collector device when you add it into monitoring. You can verify this is the case by expanding the device in the Resources tree and looking for the “Collector” DataSource group.

Verify Collector DataSources have been applied to the Collector device by expanding the device in the Resources tree and looking for the "Collector" DataSource group

If the Collector DataSources were not automatically applied to the device, you can do so manually by adding the value of “collector” to the device’s system.categories property. For more information on setting properties, see Resource and Instance Properties.

LogicMonitor will now index this device as the host of a Collector, and automatically apply the Collector DataSources to it. Once Collector DataSources are in place, you can configure alerts to warn you when Collector performance is deficient.

Note: Collector DataSources only monitor the device’s preferred Collector (as established in the device’s configurations). The preferred Collector should be the Collector that is installed on that device. Otherwise, the Collector’s metrics will display on the wrong host. For example, if you attempt to monitor Collector A using Collector B (installed on a separate host), then Collector B’s metrics will display in lieu of Collector A’s on Collector A’s host.

Collector DataSources

Migration from Legacy DataSources

In March of 2019, LogicMonitor released a new set of Collector DataSources. If you are currently monitoring Collector hosts using the legacy DataSources, you will not experience any data loss upon importing the newer DataSources in this package. This is because DataSource names have been changed to eliminate module overwriting.

However, you will collect duplicate data and receive duplicate alerts for as long as both sets of DataSources are active. For this reason, we recommend that you disable the legacy Collector DataSources. The legacy DataSources are any Collector DataSources whose names are NOT prefixed with “LogicMonitor_Collector”. If prefixed with “LogicMonitor_Collector”, it is a current Collector DataSource.

When a DataSource is disabled, it stops querying the host and generating alerts, but maintains all historical data. At some point in time, you may want to delete the legacy DataSources altogether, but consider this move carefully as all historical data will be lost upon deletion. For more information on disabling DataSources, see Disabling Monitoring for a DataSource or Instance.

DataSource Example Highlight: Collector Data Collecting Tasks

One of the Collector DataSources applied is the “Collector Data Collecting Tasks” DataSource. It monitors statistics for collection times, execution time, success/fail rates, and number of active collection tasks. One of the overview graphs available for this DataSource features the top 10 tasks contributing to your Collector’s load, which is extremely useful for identifying the source of CPU or memory usage.

The Top 10 Collection Tasks overview graph is extremely useful for identifying the source of CPU or memory usage

Routing Collector Down Alerts

A Collector is declared down when LogicMonitor’s servers have not heard from it for three minutes. Even though you will likely have a backup Collector in place for when a Collector goes down, it’s never an ideal situation for a Collector to be unexpectedly offline. To minimize downtime and mitigate the risk of interrupted monitoring, ensure that “Collector down” alerts will actively be delivered (as email, text, and so on) to the appropriate individuals in your organization. (These alerts will also be displayed in the LogicMonitor interface. )

Important: When a Collector is declared down, alerts that were triggered by the devices monitored by that Collector before the Collector went down will remain active but new alerts will not be generated while the Collector is down. However, devices that do not fail over to another Collector will ignore the alert generation suppression and may generate Host Status alerts while the Collector status is down.

To route Collector down alerts, open the Collector’s configurations (Settings | Collector | Manage) and specify the following:

Note: By default, an “Alert clear” notification is automatically delivered to all escalation chain recipients when a downed Collector comes back online. You can override this default by expanding the Collector details and unchecking the Alert on Clear option, shown next. However, if the Collector’s designated escalation chain routes alert notifications to an LM Integration, we recommend that you do not disable this option. For more information, see Alert Rules and Escalation Chains.

LogicMonitor’s collectors are configured to work well in most environments, but can need tuning.

Performance Overview

There is a trade-off between the collector’s resource consumption (CPU and memory) and performance. The collector by default does not consume many resources, so tuning of the collector may be required in large environments, environments where a collector is not doing a variety of work (e.g. a collector doing almost all JMX collection, instead of a mix of SNMP, JMX and JDBC), or environments where many devices are not responding.  Tuning may involve adjusting the collector’s configuration, or it may involve redistributing workloads.

A common reason for collectors to no longer be able to deal with the same devices they have been monitoring is if some devices no longer respond. For example, if a collector is monitoring 100 devices with no queuing, but then starts showing task queueing, or is unable to schedule tasks, this may well be because it can no longer collect data from some of the devices.  If it was talking to all those devices via JMX, and each device normally responded to a JMX query in 200 ms, it could cycle through all the devices easily. However, if the JMX credentials now mismatch on 10 of the hosts, so that they do not respond to LogicMonitor queries – the collector will keep a thread open until the configured JMX timeout. It will now be keeping several threads open, waiting for responses that will never come. Tuning can help in this situation.

How do I know if a Collector needs tuning?

Assuming you’ve set up Collector monitoring, you will be alerted by the Collector Data Collecting Task datasource if the collector is unable to schedule tasks. This is a clear indication that the workload of a collector needs tuning, as data is not being collected in accordance with the datasource schedule. This may result in gaps in graphs. Another metric to watch is the presence of elements in the Task Queue. This indicates that the collector is having to wait to schedule tasks, but that they are still completing in the appropriate time – so it’s a leading indicator of a collector approaching its configured capacity.

You can see on the below graphs that the Collector datasources clearly show an overloaded collector – there are many tasks that cannot be scheduled, and the task queue is very high. After tuning (Aug 26), the number of successful tasks increases; unscheduled tasks drops to zero, as does the task queue.

 

How would I know if a Collector needs tuning?

A good proactive behavior is to create a collector dashboard, and create a Custom Graph showing the top 10 collectors by the datapoint UnavailableScheduleTaskRate for all instances of the Data Collecting Task datasource on all devices, and another showing the top 10 collectors by TasksCountInQueue. Given each collector has many instances of these datasources (one for each collection method), you may have to specify specific collection methods as instances – snmp, jmx, etc – in order to not exceed the instance limit on a custom graph. Otherwise set instances to a star (*) to see all methods on one graph.

Collector Tuning

The easiest way to tune your Collector is simply to increase the Collector size. The small Collector only uses 2GB of memory, but can perform more work if upgraded to a larger size (and the server running the Collector has the memory available). The Collector’s configurations can also be modified manually, as discussed in Editing the Collector Config Files

In general, there are two cases that could require Collector tuning:

  • when devices are not responding
  • when the Collector cannot keep up with the workload

Both are often addressed by increasing Collector Size, which should be your first step. However, if you’ve already tried increasing the size and still see performance issues, you may find it helpful to do a little fine tuning.

Devices not responding

If devices are failing to respond to a query from the collector, because they have had their credentials changed, the device is offline, the LogicMonitor credentials were incorrectly set, or other reasons, you should get alerts about the protocol not responding on the device. The best approach in this situation is to correct the underlying issue (set the credentials, etc), so that the monitoring can resume on the devices. However, this is not always possible.  You can validate from the  Collector debug window (Under Settings…Collectors…Manage Collector…Support…Run Debug Command) whether this issue is impacting your collectors. If you run the command !tlist c=METHOD, where method is the data collection method at issue (jmx, snmp, WMI, etc), you will get a list of all the tasks of that type the collector has scheduled.

If you see many tasks that failed due to timeout or non-response – those tasks are keeping a thread busy for the timeout period of that protocol. In this situation, it may be appropriate to reduce the configured timeout, to stop threads from blocking for so long. The default for JMX timeouts was 30 seconds at one point – which is a very long time for a computer to respond.  Setting that to 5 seconds (the current default) means that for a non-responsive device, 6 times as many tasks can be processed in the same time.  Care should be taken when setting timeouts to ensure they are reasonable for your environment. While it may be appropriate to set the JMX timeout to 5 seconds, the webpage collector may be left at 30 seconds, as you may have web pages that take that long to render. Setting a timeout to a shorter period than it takes devices to respond will adversely affect monitoring.

To change the timeout for a protocol, you must edit the collector configuration manually from the Collector Configuration window. Edit the collector.*.timeout parameter to change the timeout for the protocol you want (ex: change collector.jmx.timeout=30 to collector.jmx.timeout=5).

You may also need to increase the number of threads, as well as reducing the timeout period – see the section below.

Collector cannot keep up with workload

If the Collector is still reporting tasks cannot be scheduled, it may be appropriate to increase the number of threads for a collection method. This will allow the collector to perform more work simultaneously (especially if some threads are just waiting for timeouts), but will also increase the collectors CPU usage.

To increase the threads available to a collection method, you must edit the collector configuration manually from the Collector Configuration window. Edit the collector.*.threadpool parameter to change the threadpool allotment for the protocol you want (ex: change collector.jmx.threadpool=50 to collector.jmx.threadpool=150).

It is recommended to increase the threadpool setting gradually – try doubling the current setting, then observing behavior.  Note changes in the collector’s CPU utilization, and Heap utilization – more threads will use more CPU, and place more demands on the JVM heap. If the collector heap usage (shown under the Collector datasource Collector JVM Status) is approaching the limit, that may need increasing too.

If a collector has had its threads increased and its heap increased, and still cannot keep up with the workload (or is hitting CPU capacity) – it is time to add another collector and split the workload amongst the collectors.