I’ve been investigating a performance problem in a VM on one of our ESXi 5 clusters that led to an interesting discovery about power savings settings on the ESXi host. Basically under certain scenarios (and perhaps specific CPUs) they physical CPUs will be down clocked even though a VM is trying to use 100% of its CPU.
The physical host servers are HP DL385 G7 with 2 AMD Opteron 6174 12 core processors @ 2.2GHz and 128 GB of RAM. They boot from an integrated SD Flash card and all other storage is provided by our Compellent SAN.
In the bios there are 3 key settings under the Power Management Options:
HP Power Profile – This defaults to “Balanced Power and Performance” but I’ve changed it to “Maximum Performance”
HP Power Regulator – This defaults to “HP Dynamic Power Savings Mode” but changes automatically to “HP Static High Performance Mode” after changing the power profile setting
Advanced Power Management -> Minimum Processor Idle Power State – This defaults to “No C-states” and that is what we want it set to
The VM I’m testing with has 4 vCPU and 8GB RAM assigned to it. This VM is the host for a Lotus Domino server with some custom applications. When the application is used it can cause the CPU to go to 100% utilization within the VM.
From testing the same processes over and over we observed that each process would take 50-150% longer to run with the bios set to Balanced vs having it set to Max.
What I believe is happening is that while the VM is running at 100% cpu it only using 4 of the 12 cores of a single physical socket (and 4 of 24 total in the host) and the other VMs on this host are all light CPU load so the physical host perceives itself to be lightly loaded and so is down clocking the CPU. So our VM running at 100% CPU is not getting 2.2GHz of clock speed but some lesser amount depending on how much down clocking the host has done. Since that down clocking is dynamic that would also account for the performance variance we are seeing.
In googling around I’ve found other people using the AMD Opteron 61xx series processors with VMWare having a similar issue. It’s possible this is just an issue with that line as I don’t believe a CPU should slow the clock speed dynamically if a single core is being used completely (rather than relying on an average load accross all cores to determine if it should save power by down clocking).
We have another cluster that uses AMD Opteron 6282 SE processors I plan to do some additional testing on to see if the problem exists there as well. I’ll update this post once I’ve had a chance to do that.
For now all of our hosts using the 6174 processors have been set to force max performance (more power and heat unfortunately).