Monitoring NTP. And why it matters…

LogicMonitor Opinion post

A guest post by one our intrepid Support Engineers in the UK, Antony Hawkins.

time-is-going-1415573-m“Catching the little problems you never knew you had (before they cause big problems you never want to deal with).”

So, you’ve configured and tested an NTP hierarchy through your estate and now all your devices run to the same time. You can leave it alone now, safe in the knowledge it’s working.
Can’t you?

Recently a customer got in touch, querying why LogicMonitor had started generating NTP alerts on parts of his estate, when his NTP setup had been configured and working correctly for some time previously.

Sure enough, LogicMonitor was reporting that some hosts had no NTP peers, even though those hosts clearly had peers set in their ntpd.conf files.

On further investigation it transpired that, while the hosts had peers, the peers themselves were, at times, being declared to be falsetickers. When all the peers for a given host were declared falsetickers, the host quite correctly reported that it had no valid peers from which to take a time signal. LogicMonitor then raised a ‘no peers’ alert – while NTP was running, no peers had passed the selection process.

LogicMonitor was only reporting information gathered from the hosts, so, if you had the time to manually check NTP responses on all your hosts every five minutes and knew what you were looking for, you could have found the falseticker responses manually. However, if you’re anything like me you probably don’t have the time to check the NTP configuration and responses on your whole estate every five minutes just in case something’s wrong!

Does this loss of NTP synchronization matter?  It certainly can. If you are trying to correlate log entries across different systems – this is practically impossible unless they share a common time setting. Loss of correct time can also cause SSL certificates to fail validation, VPNs to break, and all sorts of insidious hard to find issues….

Without automated monitoring, you simply wouldn’t know there was a problem until a machine became sufficiently out of sync with the rest of your estate to cause other problems, for example in your business-critical database applications. When those critical applications start having problems, is your entire NTP hierarchy the first thing you check?  If you set-it-and-forget-it – probably not. But that’s what monitoring is for – so you can focus on more strategic issues, confident that monitoring will alert you to issues when needed.