Microsoft Cluster Fun

Technology

I had an interesting experience recovering a single node windows 2008 R2 cluster running multiple MSSQL 2008 instances.  We suffered a power failure that caused the server to reboot and after coming up the cluster service would crash at start.

Initially the only thing to go on was a single entry in the System Event Log for Event ID 1573:

Node ‘Servername’ failed to form a cluster.  This was because the witness was not accessible.  Please ensure that the witness resource is online.

I checked on the quorum disk and it’s there and marked as reserved as expected.  Head scratching commenced for a bit.  Tried a reboot just to make sure and had the same issue.  Tried to manually start the service and had the same issue.  Did some googling on the error and chased down a few items that ended up not being anything.

Tried to start the service with the fixquorum flag with no result.  Also tried to use the resetquorumlog with no result.

I discovered the cluster log command to generate a text file log of the cluster service which is when I finally started to make some progress:

Open a command prompt and run cluster log /g

This will output a file Cluster.log in C:\Windows\Cluster\Reports

On initial review of the log I found:

00000990.00000cf8::2012/11/14-13:18:25.643 ERR   mscs::QuorumAgent::FormLeaderWorker::operator (): ERROR_FILE_NOT_FOUND(2)’ because of ‘OpenSubKey failed.’

Which told me there was something wrong in the registry hive for the cluster.  The hive for the cluster is located in C:\Windows\Cluster and is a file called CLUSDB.  This file is automatically expanded and loaded under HKLM in the registry when the cluster service starts.  It was during this process that the server was crashing out so something was corrupted or wrong in the file.

My first attempt at a fix was to recover the CLUSDB file from a midnight snapshot taken about 3 hours prior to the power issue that caused the reboot.  Unfortunately this did not solve the problem which made me realize that something had changed or corrupted the file prior to the reboot and it just didn’t show itself until the reboot.  I went back to the Cluster.log file to see if I could find any more information.  I was regenerating the cluster.log file (cluster log /g) after each attempt to start the service to see if anything was changing and I notice something common with each startup:

000014b0.00000cf8::2012/11/14-13:25:39.708 DBG   [RCM] Resource ‘SQL Server (INSTANCENAME)’ is hosted in a separate monitor.
000014b0.00000cf8::2012/11/14-13:25:39.708 DBG   [RCM] rcm::RcmAgent::Unload()
000014b0.00000cf8::2012/11/14-13:25:39.708 INFO  Shutdown lock acquired, proceeding with shutdown

On each startup it would fail after the same INSTANCENAME and start to shutdown the service but I knew there should have been more Resources listed which meant the problem may be with the resource right after the last INSTANCENAME noted in the log.

With the cluster service stopped (so it wouldn’t try to restart and the hive wouldn’t be loaded) I launched regedit.  I navigated to HKLM and did a File->Load Hive and selected the CLUSDB file in C:\Windows\Cluster and gave it the name “Cluster” when prompted.  I then expanded the new cluster folder and then the resource folder and started to go through the list.  I quickly realized the order of resources in the folder matched how they were being noted in the Cluster.log file.  The resource that was next after the INSTANCENAME that was last noted in the Cluster.log was the Available Storage resource.  In looking at the keys for that resource I realized it had other resource ID’s listed in the “contains” key which should be storage resources that were in the Available Storage group except I knew that there shouldn’t be any.  I made note of the two resource ID’s in the contains key and went through the rest of the resources to make sure they didn’t actually exist and they didn’t.  I then went back to the contains key for the Available Storage resource and edited it and removed the two entries.  I then highlighted the Cluster folder under HKLM and unloaded the hive File->Unload Hive and then closed out regedit.  I started up the cluster service manually and this time everything started up correctly.

So what happened?

Roughly 2 weeks prior to this outage an Instance had been removed from the cluster.  It had 4 storage devices associated with it which were initially moved to the available storage group after being removed from the instance group and then were deleted as disks from the cluster.  Apparently this process (done via the failover cluster gui) failed to fully remove 2 of the 4 objects from the registry correctly.  I’ve found a few other people suggesting to always use the command line cluster program to remove resources to be extra safe which I plan to do from now on.  The problem did not show up until the next time the cluster service restarted.

1 Comment

Cinnamon Toasted Oven Roasted Whole Chicken

Recipes

I found this recipe on www.paleo-project.com and the link on that site is now dead unfortunately but that is where the credit goes.  I made this for the first time tonight and it turned out awesome.  The chicken was extremely moist and very flavorful.  I made just a few changes which really just came down to increasing the amount of spices in the rub as it felt a little light in the original recipe.

Ingredients:

  • 1 Whole Chicken (around 4 lbs)
  • 2 tbsp honey
  • 2 tsp salt
  • 1 tsp ground nutmeg
  • 1 tsp ground cloves
  • 1 tsp allspice
  • 1 tsp ground cinnamon
  • 5 garlic cloves (crushed)
  • 1 tsp whole cloves
  • 1 tsp coconut or olive oil

Recipe:

  • Mix together the salt, nutmeg, ground cloves, allspice and cinnamon in a small bowl
  • Cover a baking pan with aluminum foil (I used a cookie sheet with a lip high enough to hold the juices)
  • Grease the spot the chicken will rest with the oil
  • Preheat oven to 500 F
  • Rinse chicken in cool water and pat dry (don’t forget to remove the extra bits often in a plastic bag inside the cavity!)
  • Put the crushed garlic and whole cloves
  • Spread the honey all over the outside of the chicken
  • shake or rub the spice mixture all over the outside of the chicken
  • Space the chicken on the pan breast side up
  • Cook for 15 minutes at 500 F
  • Lower the temperature to 450 F and cook for another 15 minutes
  • Lower the temperature to 425 F, baste the chicken with liquid from the pan and cook for another 30 minutes or until the chicken is around 180 F in the breast
  • Remove the chicken from the oven and allow to rest for 20 minutes before serving

 

No Comments

Jewel Studded Salmon with Cilantro Cream Cheese

Recipes

While I haven’t made this in a little while I wanted this to be the first recipe I shared simply because of the picture.  The recipe itself I found originally on www.grouprecipes.com and my picture is on that page as well which I posted the first time I made it.  This is not a Paleo (how I’m eating now) recipe (cream cheese) but would probably qualify as Primal (how I will be eating in a few more weeks).  I’ll post more on the subject of Paleo and Primal eating soon but here is the recipe:

Ingredients

  • 2 large salmon filets with skin removed
  • 2 cloves garlic
  • 4 tbsp softened cream cheese
  • 4 stacks of fresh cilantro chopped
  • 1 tsp kosher salt

Instructions

  • Peal and mince the garlic
  • Mix the garlic, cilantro and cream cheese together in a bowl
  • Butterfly the salmon filets (if the filets are not thick enough to butterfly easily you can also just cut a line down the middle of the filet not quite all the way through and then fold the filet in half to be just like butterflied)
  • Spread the cream cheese mixture inside of each filet and fold closed (cream cheese should be on the inside)
  • Place on a hot oiled grill and cook for about 6 minutes on each side
  • Sprinkle with kosher salt on both sides while grilling

Results

No Comments

When power saving is not your friend

Technology

I’ve been investigating a performance problem in a VM on one of our ESXi 5 clusters that led to an interesting discovery about power savings settings on the ESXi host.  Basically under certain scenarios (and perhaps specific CPUs) they physical CPUs will be down clocked even though a VM is trying to use 100% of its CPU.

The physical host servers are HP DL385 G7 with 2 AMD Opteron 6174 12 core processors @ 2.2GHz and 128 GB of RAM.  They boot from an integrated SD Flash card and all other storage is provided by our Compellent SAN.

In the bios there are 3 key settings under the Power Management Options:

HP Power Profile – This defaults to “Balanced Power and Performance” but I’ve changed it to “Maximum Performance”

HP Power Regulator – This defaults to “HP Dynamic Power Savings Mode” but changes automatically to “HP Static High Performance Mode” after changing the power profile setting

Advanced Power Management -> Minimum Processor Idle Power State – This defaults to “No C-states” and that is what we want it set to

The VM I’m testing with has 4 vCPU and 8GB RAM assigned to it.  This VM is the host for a Lotus Domino server with some custom applications.  When the application is used it can cause the CPU to go to 100% utilization within the VM.

From testing the same processes over and over we observed that each process would take 50-150% longer to run with the bios set to Balanced vs having it set to Max.

What I believe is happening is that while the VM is running at 100% cpu it only using 4 of the 12 cores of a single physical socket (and 4 of 24 total in the host) and the other VMs on this host are all light CPU load so the physical host perceives itself to be lightly loaded and so is down clocking the CPU.  So our VM running at 100% CPU is not getting 2.2GHz of clock speed but some lesser amount depending on how much down clocking the host has done.  Since that down clocking is dynamic that would also account for the performance variance we are seeing.

In googling around I’ve found other people using the AMD Opteron 61xx series processors with VMWare having a similar issue.  It’s possible this is just an issue with that line as I don’t believe a CPU should slow the clock speed dynamically if a single core is being used completely (rather than relying on an average load accross all cores to determine if it should save power by down clocking).

We have another cluster that uses AMD Opteron 6282 SE processors I plan to do some additional testing on to see if the problem exists there as well.  I’ll update this post once I’ve had a chance to do that.

For now all of our hosts using the 6174 processors have been set to force max performance (more power and heat unfortunately).

No Comments

Relaunch

Personal

I’ve finally taken the time to rebuild and relaunch my website.  I’m going for more of a mixture this time and hopefully this version will have some staying power.  In addition to technology and other geek related posts I plan to bring in another aspect I enjoy which is cooking through recipes, pictures and other food commentary.

No Comments
Newer Posts »