A Tale of Two Metrics: Windows CPU or vCenter VM CPU

A not uncommon question from our customers, or even from our own support people, is “Why does monitoring a Windows system running on VMware report different CPU data than monitoring the virtual machine from the ESXi host? The ESX monitoring must be wrong!”

For example, here is LogicMonitor graphing the CPU load of a Windows system running as a Virtual Machine on ESXi. In this case, the CPU is gathered from WMI, by querying the Windows OS:

CPU load of a Windows system running as a Virtual Machine on ESXi

Here is the same machine at the same time, but this is how ESXi sees the load:

How ESXi sees CPU load of a Windows system running as a Virtual Machine

So which view is right? Why do they differ?  Which should you pay attention to?

Both views are right.  This Guest virtual machine was a Windows system, with four vCPUs.  I was running HyperPi, set to use 3 CPUs. So from the point of view of Windows (the top graph), it had three CPUs it was running at 100% (for an average of about 75% of the total system, which is what the top graph shows.)

Screenshot 2:24:13 8:35 AM

However, from the point of view of ESX, this was not the only guest using those 4 CPUs.  The hypervisor has limited CPU resources, and it was sharing them with quite a few other systems. Looking at a graph of just the top 10 VMs by CPU usage shows that when the system under test (LMNOD1) increases its CPU demands, it takes CPU resources from other systems:

Graph of Top 10 VM's by CPU usage

Which explains why even though the Windows system was provisioned with 4vCPUS, it was only able to get the resources of slightly less than 50% of the capacity of the 4 virtual CPUs it was provisioned with.

So why did Windows think it was getting more CPU than ESXi was giving it?  Because it doesn’t know it’s virtualized.

There are times when the Guest OS (windows perfmon, etc) will show lower CPU usage than VMware reports.  The guest doesn’t know anything about the CPU used to virtualize the hardware resources it is requesting. ESXi does, and accurately attributes that load. Comparing the top two graphs, you can note that outside the period of load test, Windows reports a slightly lower CPU resource usage than does ESXi.

There are also times when the Guest OS will report a higher usage of CPUs than the hypervisor will, as in the period when the load test was running above.  Windows doesn’t know that part of the time, the CPUs are being stolen away from it and given to other guest machines. As far as it knows, it is using all the CPU it can on 3 of the 4 CPUs- so that must equal 75% load. It doesn’t know that part of the time, it has no access to the CPU’s as they are being used elsewhere, so the real time CPU usage is only about 50%.

There are other reasons that the internal measurements can differ, such as clock skew and adjustments, but suffice it to say that if you are looking for an absolute view of how CPUs are being used, you cannot trust the guest’s view of itself.

Does that mean the Guest’s view is meaningless? Far from it.  If you are monitoring the guest itself with LogicMonitor, or using Perfmon or TaskManager on it directly, and see high CPU usage – this is still a valid indicator that the guest is trying to do a lot of work, and running out of CPU resources. Whether it is running out of resources because it is actually using all the capacity of the physical CPU (which would be the case in a standalone machine running at 100%), or running out of CPU resources because it is sharing them unknowingly with other virtual systems does not matter.  The relevant fact remains that the system is using all the CPU it can. It may warrant adjusting CPU resources, reservations or shares, or investigating the workload.

Is the ESXi host’s view of CPU important, then? It is if you are the Vcenter administrator, and want to know which systems are actually using real CPU resources. (Overview graphs like the one shown above are very helpful for investigating these issues quickly.)  But you cannot use the ESXi host’s view of the CPU load to tell if a system wants more than the resources allocated to it – only how much of the resources allocated to it are being used.  But if other systems are competing for the same resources, any one guest may not get the resources it’s allocated.  If you are looking at ESX, a better way to tell if the guests are wanting more resources is to look at CPU Ready (which means that the guest had work ready to schedule on a CPU, but no CPU resource was available.) The graph for the VM above, during the period of the load test, clearly shows a big spike in CPU ready time:

Graph of CPU Ready on a virtual machine

So, whether you are looking at Vcenter, Perfmon, or the LogicMonitor view of the ESXi host or the Guest – you need to understand what you are looking at, how it matters, and why the different views don’t agree.

(Note: with VMWare Tools installed, perfmon in the guest can show you the accurate CPU usage under the “VM Processor” counter. This will be the same CPU usage you see in Vcenter, or in LogicMonitor from monitoring ESXi or VCenter.  But you still need to know what it means and how to interpret it.)