Troubleshooting Active Discovery with functioning SNMP
Sometimes you'll find a particular datasource is not applying to a device, when all other datasources using the same protocol seem to be working fine. There could be many causes to why this may be happening, but it is ideal to start with the most basic causes first, then work gradually work your way into more intensive checks.
In this example, we had a device that had discovered all SNMP related datasources, but the Interfaces datasource was not displaying on the host. It does not seem to be a permissions or protocol issue, as other datasources using the same protocol are working fine.
Check that the datasource is associated with the host
You can check this by going to the datasource and checking to see if the host is associated with the datasource in question. Go to Setting/LogicModules/Datasources, then search for the datasource you want to check. Once you click on the datasource, to the upper right corner, you should see a box that says "More" click on that and click "Show Associate Devices"
Check that the target device appears in the association list:
Try is running Active Discovery manually on the host. To do this, click the blue arrow next to the Manage button at the device, and select Run Active Discovery.
After running Active Discovery manually, a refresh of the page about 30 seconds later should show the datasource appear with instances, if that was indeed the issue. If it did not, we can move onto further investigation.
Active Discovery Timeout
We can go to the collector debug window and see further details of the discovery task. Once in the debug console, we will want to check the current running Active Discovery tasks for this host. If there are other devices on this collector, you can minimize the results by filtering out only for the device in question.
From the results, you can search for the SNMP64_if- datasource, which is the one we are trying to track down, and can see that the Active Discovery task time out.
From here I would normally check to see if that perhaps the device might be taking longer to respond to that specific query than normal, which is why this particular datasource is timing out and not discovering instances. In this example, I will use the OID used to discover instances in the SNMP64_if- datasource and run queries to test for timeout.
The first query I will run is to just check if I can get a response running an snmpwalk to the device for the AD OID using the default timeout values (3 seconds). In the example below, you will notice that walking that OID results in a timeout, as noted in the original AD task.
We can increase the timeout manually to something a little longer and see if we get results back. Here, it was increased to 30 seconds to allow enough time for the device to respond.
If it looks like the issue may be related to a longer than normal response from the device, we can then adjust the collector configuration file to allow the collector to wait a bit longer for responses for specific protocols. Be very conservative when adjusting these settings, as setting the wait time too high can start to impact collector performance and cause tasks to start backing up, eventually causing them to fail. I would start by doubling the timeout timeout time from something like 5 seconds to 10 seconds, saving, then testing the results.
We have seen AD failures related to the response timing out on queries with large amount of instances before. Juniper devices are notorious for having a very large amount of interfaces, which takes them a bit longer to respond to to our SNMP queries. In the past we have adjusted the SNMP timeout for AD as shown above and it has resolved the issue allowing the large amount of instances to be discovered.
Check Datasource filters
There have been times when filters are the cause of instances not being discovered, which can be due to the filter criteria themselves, or possibly related to timeouts as mentioned in the previous section. With SNMP filters, we usually check an OID and filter based on the response. While troubleshooting, you may see that the OID used to discover the instances returns perfectly fine, but one of the OIDs used in the filters are timing out, causing all of Active Discovery to fail, and no instances being discovered. When trying to determine if filters are causing issues there are a few different methods you can take to approach the issue.
METHOD 1 - Clone and Remove
You can easily rule out if it is a filter issue by cloning the datasource, removing all filters, applying to the host you are troubleshooting against, then saving the datasource. If instances are discovered immediately with no filters on the cloned datasource, then you can start looking into each individual filter on the original datasource and try to identify which one is causing the failure.
You can do this by looking at the values in the debug window. In this example you can see that we ran the OIDs for each filter in the datasource, and based on the process of elimination, only one Instance should be discovered in this scenario.
METHOD 2 - Check the details of the Active Discovery task using the !adetail command
Rather than the clone and remove method, where you can quickly identify filters as the cause of the issue and then work your way backwards by processing the filters, you can also check the details of the Active Discovery task.
A good indication that a filter is the cause of instances not being discovered is that you will see the active discovery task in the debug command window, and the task did not say that it failed, but that it actually completed. If it completed, but no instances are being displayed, it means it found instances based on the OID used for Active Discovery, but when it processed the filters set on the datasource, it did not meet the criteria and therefore no instances were shown for the device.
Use the taskid and the !adetail command to see more information.
Once we run this command we can see the following output, which will show us what instances were discovered, which filters were being processed, and the final outcome of which instances made it through the process. In this example I created a filter looking for interfaces that had the words "failtest" in the instance name. These instances obviously did not have it, which resulted in no instances being discovered. To resolve this you would remove the filter.