Linux Monitoring, Net SNMP and terabyte file systems

Or, how to deal with signed integers in a way that makes sense when doing Linux Monitoring.

A customer contacted us this week and said “Hey, one of my filesystems that was being monitored by LogicMonitor disappeared after I grew it.”  Turns out the filesystem in question was now a bit over 2 terabytes.

Some poking around showed that the file system was being filtered out of discovery, as net-snmp was reporting a size for the file system (via 1.3.6.1.2.1.25.2.3.1.5, hrStorageSize) of -1982127408. Yes, that’s a negative value.

The hrStorageSize obect is defined as Integer32 – so it’s really a signed integer. Go above 2147483648 allocation units, and you’ll be in negative territory (as the first bit will be interpreted as the sign.)

So, instead of disk Usage (as a percentage) being calculated:

  • let StorageSize be the value reported by .1.3.6.1.2.1.25.2.3.1.5 (hrStorageSize) for the filesystem
  • let StorageUsed be the value reported by .1.3.6.1.2.1.25.2.3.1.6(hrStorageUsed) for the filesystem
  • thus the percentage of disk Usage is: 100*StorageUsed/StorageSize

we can change the formula LogicMonitor uses to calculate the percentage of disk space to:

100*(if(lt(StorageUsed,0),4294967296+StorageUsed,StorageUsed))/
(if(lt(StorageSize,0),4294967296+StorageSize,StorageSize))

which takes account of the fact that anything above 2147483648 will be reported as a negative number, and corrects for it.

In English, the above formula says:

  • if StorageUsed <0, add 4294967296 (2^32) to it
  • if StorageSize < 0, add 4294967296 to it
  • then compute as before: PercentUsage = 100*StorageUsed/StorageSize

We use a similar formula in the graphing definition of the Linux Disk Usage datasource, although there the values are also multiplied by the size of the Allocation Units, so you get an accurate representation of the size of the file system:

We’ve updated LogicMonitor and it’s core datasource repository, so now all customers will be able to avoid this problem if they deploy Terabyte size filesystems.

This adjustment can be used for other values reported as signed integers when you don’t want them treated as signed.  So, for everyone running into this issue – you don’t need to update net-snmp (which there seems to be a lot of people calling for); or define a new MIB object. Just configure your monitoring and graphing systems to correct for the sign, as above.

And if your monitoring systems can’t, well – you can always switch to LogicMonitor.