Netapp Monitoring – too much of a good thing?

One way LogicMonitor is different from other NetApp monitoring systems (other than being hosted monitoring, and being able to monitor the complete array of systems found in a datacenter – from AC units, through virtualization, OS’s to applications like MongoDB) is that we default to “monitoring” on”.

i.e. we assume you want to monitor everything, always. (You can of course turn off monitoring or alerting for specific groups, hosts or objects.)  This serves us well almost always – we will detect a new volume on your NetApp once you create it, and start monitoring it for read and write latency, number and type of operations, etc – this means that when you have an issue on that volume (or other groups are blaming storage performance for an issue), you already have the monitoring of all sorts of metrics in place before the issue – so you have the data and alerts to know whether the storage was or was not the issue.

However, we have found some cases where this doesn’t work so well. We have been monitoring NetApp disk performance by default, too, tracking the number of operations and the busy time for each disk.  However, on customers with larger NetApps, there are often hundreds of disks, each of which we would monitor via API queries.  This is useful for identifying when disks need to be rebalanced (if the top 10 and bottom 10 disks by busy time are wildly different.) And while we only monitor the performance of a disk every 5 minutes (as opposed to volumes and PAM cards and things that are monitored more frequently), this apparently overloads the API subsystem of NetApp devices.

We’d see that when we’d restart the collection process, and the only monitoring by the API was for the volume performance, things worked great – the response to an API request from a NetApp was around 100 ms.

When the disk requests started getting added in (and we stagger and skew the requests, so they are not all hitting at once) – the API response time for a single query climbed up to 40 seconds.

This started causing a backlog of monitoring, and was causing data to be missed in the more important volume performance metrics.

So… while we’ll open a case with NetApp, in the interim, we’ll probably disable the monitoring of physical disk performance by default to avoid this issue.