I had an interesting experience recovering a single node windows 2008 R2 cluster running multiple MSSQL 2008 instances. We suffered a power failure that caused the server to reboot and after coming up the cluster service would crash at start.
Initially the only thing to go on was a single entry in the System Event Log for Event ID 1573:
Node ‘Servername’ failed to form a cluster. This was because the witness was not accessible. Please ensure that the witness resource is online.
I checked on the quorum disk and it’s there and marked as reserved as expected. Head scratching commenced for a bit. Tried a reboot just to make sure and had the same issue. Tried to manually start the service and had the same issue. Did some googling on the error and chased down a few items that ended up not being anything.
Tried to start the service with the fixquorum flag with no result. Also tried to use the resetquorumlog with no result.
I discovered the cluster log command to generate a text file log of the cluster service which is when I finally started to make some progress:
Open a command prompt and run cluster log /g
This will output a file Cluster.log in C:\Windows\Cluster\Reports
On initial review of the log I found:
00000990.00000cf8::2012/11/14-13:18:25.643 ERR mscs::QuorumAgent::FormLeaderWorker::operator (): ERROR_FILE_NOT_FOUND(2)’ because of ‘OpenSubKey failed.’
Which told me there was something wrong in the registry hive for the cluster. The hive for the cluster is located in C:\Windows\Cluster and is a file called CLUSDB. This file is automatically expanded and loaded under HKLM in the registry when the cluster service starts. It was during this process that the server was crashing out so something was corrupted or wrong in the file.
My first attempt at a fix was to recover the CLUSDB file from a midnight snapshot taken about 3 hours prior to the power issue that caused the reboot. Unfortunately this did not solve the problem which made me realize that something had changed or corrupted the file prior to the reboot and it just didn’t show itself until the reboot. I went back to the Cluster.log file to see if I could find any more information. I was regenerating the cluster.log file (cluster log /g) after each attempt to start the service to see if anything was changing and I notice something common with each startup:
000014b0.00000cf8::2012/11/14-13:25:39.708 DBG [RCM] Resource ‘SQL Server (INSTANCENAME)’ is hosted in a separate monitor.
000014b0.00000cf8::2012/11/14-13:25:39.708 DBG [RCM] rcm::RcmAgent::Unload()
000014b0.00000cf8::2012/11/14-13:25:39.708 INFO Shutdown lock acquired, proceeding with shutdown
On each startup it would fail after the same INSTANCENAME and start to shutdown the service but I knew there should have been more Resources listed which meant the problem may be with the resource right after the last INSTANCENAME noted in the log.
With the cluster service stopped (so it wouldn’t try to restart and the hive wouldn’t be loaded) I launched regedit. I navigated to HKLM and did a File->Load Hive and selected the CLUSDB file in C:\Windows\Cluster and gave it the name “Cluster” when prompted. I then expanded the new cluster folder and then the resource folder and started to go through the list. I quickly realized the order of resources in the folder matched how they were being noted in the Cluster.log file. The resource that was next after the INSTANCENAME that was last noted in the Cluster.log was the Available Storage resource. In looking at the keys for that resource I realized it had other resource ID’s listed in the “contains” key which should be storage resources that were in the Available Storage group except I knew that there shouldn’t be any. I made note of the two resource ID’s in the contains key and went through the rest of the resources to make sure they didn’t actually exist and they didn’t. I then went back to the contains key for the Available Storage resource and edited it and removed the two entries. I then highlighted the Cluster folder under HKLM and unloaded the hive File->Unload Hive and then closed out regedit. I started up the cluster service manually and this time everything started up correctly.
So what happened?
Roughly 2 weeks prior to this outage an Instance had been removed from the cluster. It had 4 storage devices associated with it which were initially moved to the available storage group after being removed from the instance group and then were deleted as disks from the cluster. Apparently this process (done via the failover cluster gui) failed to fully remove 2 of the 4 objects from the registry correctly. I’ve found a few other people suggesting to always use the command line cluster program to remove resources to be extra safe which I plan to do from now on. The problem did not show up until the next time the cluster service restarted.