Cisco switch temperature problems solved by newbie

As the new hire here at LogicMonitor brought in to support the operations of the organization, I had two immediate tasks: Learn how LogicMonitor’s SaaS-based monitoring works to monitor our customer’s servers, and at the same time, learn our own infrastructure.

I’ve been a SysA for a longer than I care to admit, and when you start a new job in a complex environment, there can be a humbling period of time while you spin-up before you can provide some value to the company. There’s often a steep and sometimes painful learning curve to adopting an organization’s technologies and architecture philosophies and make them your own before you can claim to be an asset to the firm.

But this time was different. With LogicMonitor’s  founder, Steve Francis, sitting to my right, and its Chief Architect to my left, I was encouraged to dive into our own LogicMonitor portal to see our infrastructure health.   A portal, by the way, is an individualized web site where our customers go to see their assets. From your portal, you get a fantastic view of all your datacenter resources from servers, storage and switches to applications, power, and load balancers just to name a few.  And YES, we use remote instances of LogicMonitor to watch our own infrastructure.  In SysA speak, we call this ‘eating our own dog food’.

As soon as I was given a login, I figured I’d kill two birds with one stone and familiarize myself with our infrastructure and see how our software worked. Starting at the top, I started looking at our Cisco switches to see what was hooked up to what. LogicMonitor has already done the leg-work of getting hooks into the APIs on datacenter hardware, so one has only to point a collector at a device with an IP or hostname, tell it what it is, ( linux or windows host, Cisco or HP switch, etc) provide some credentials and ‘Voila!’ out comes stats and pretty graphs.  Before me on our portal was all the monitoring information one could wish for from a Cisco switch.

On the first switch I looked at, I noticed that its internal temperature sensor had been reading erratic temperatures. The temperatures were still within Cisco’s spec, and they hadn’t triggered an alert yet, but they certainly weren’t as steady as they had been for months leading up to that time. For a sanity check, I looked at the same sensor in switch right next to it. The temperature was just as erratic. Checking the same sensors in another pair of switches in a different datacenter showed steady temperature readings for months.

Using the nifty ‘smart-graph’ function of LogicMonitor, I was able to switch the graph around to look at just the data range I wanted. I even added the temperature sensor’s output to a new dashboard view.  With with my new-found data, I shared a graph with Jeff and Steve, and asked, “Hey, guys, I’m seeing these erratic temperature’s on our switches in Virginia. Is this normal?”

Jeff took a 3 second glance, scowled, and said, “No, that’s not right!  Open a ticket with our datacenter ticket and have them look at that!”

That task was a little harder.  Convincing a datacenter operator they have a problem with their HVAC when all their systems are showing normal takes a little persistence. Armed with my graphs, I worked my way up the food-chain with our DC provider support staff. He checked the input and output air temperature of our cabinet, and verified there was no foreign objects disturbing air flow.  All good there.  We double-checked here that we hadn’t made any changes that would affect load on our systems and cause the temperature fluctuation.  No changes here.  But on a hunch, he changed a floor tile for one that allowed more air through to our cabinet. And behold, the result:

Looking at our graph, you’ll notice the temperature was largely stable before Sept. 13.  I was poking around in LogicMonitor for the first time on Sept. 18th. (Literally, the FIRST TIME )  and created the ticket which got resolved on Friday Sept. 21. You can see the moment when the temps drop and go stable again after the new ventilation tile was fitted.  ( In case you’re wondering, you can click on the data sources on the bottom of the graph, and that will toggle their appearance on the graph. I  ‘turned off’ the sw-core1&2.lax6 switches since they were in another data center )

Steve’s response to all this was, “Excellent! You’re providing value-add!  Maybe we’ll keep you. Now write a blog post about it!”

And I’ll leave you with this: Monitoring can be an onerous task for SysAs. We usually have to build it and support it ourselves, and then we’re the only ones who can understand it enough to actually use it. Monitoring frequently doesn’t get the time it deserves until it’s too late and there’s an outage. LogicMonitor makes infrastructure monitoring easy and effective in a short period of time. We’ve built it, we support it, and we’ve made it easy to understand so your SysA can work on their infrastructure.

Or blogging.