The value of IPMI monitoring

Amongst its many monitoring methods, LogicMonitor supports IPMI.  Many people aren’t aware of IPMI, and don’t think  it’s necessary. And while I’m certainly an advocate of avoiding unnecessary complexity in a data center, sometimes it is good to wear both a belt and suspenders.

A real life example from one of our own data centers conveniently occurred just this morning, when I was looking for fodder to blog about:

We received several email alerts, like the below:

Host: console.lab2.sjc.logicmonitor.com
Eventsource: IPMI SEL Logs-  BMC  Battery 0x11 Failed 6f [01 ff ff]
Level: error
Detected on: 2012-03-23 08:45:52 PDT

Looking at the host in question in our monitoring portal showed the repeated events:

ipmi alerts

And logging in to the device itself – a Dell DRAC card – show the events logged directly:

DRAC log

This particular device was the DRAC of a Dell server running VMWare ESXi – which of course was also monitored by LogicMonitor.

However, the hardware monitoring for the ESX host was not reporting any issues at all through vCenter or LogicMonitor – even though this specific component was monitored and reported by the ESXi API:

My guess is that the battery issues are so transient – you can see from the DRAC logs that they cleared themselves within 5 seconds – that the ESX hardware monitoring never picked them up.

So in this case, having IPMI monitoring as well as the regular ESX hardware monitoring allowed us to identify this issue much sooner. Now we can open a case with Dell, and have the issue remedied. We can migrate VMs to other ESX servers, and avoid any impact.  It is likely that the ESX software will notice the storage controller battery issues once they become severe enough, and which point LogicMonitor will alert on them – but I’d rather be aware of issues that could impact the availability and performance of my ESX hosts as soon as possible. (Performance could be impacted as the controllers will almost certainly switch to write-through mode, instead of using the NVRAM cache to accelerate writes, when there are failures of the storage controller battery.)

How many servers do you have where IPMI – or LogicMonitor’s other extensive monitoring methods – can help you avoid performance and availability issues?

Update: VCenter finally noticed the issue – about 20 hours later.  Plus, Vcenter only reports the issue as an error in “VMware Rollup Health State” – but no details as to what the issue is.