A Collection of Random Thoughts: How to save your cluster from a failed quorum drive

Tuesday, July 25, 2006

How to save your cluster from a failed quorum drive

I first have to add a disclaimer that this cluster was a development system, so the data on it was of relatively little use. I also have to add that the configuration of this system in no way shape or form would mirror anything I would allow to be put into production.

Tip 1: Make sure that your Quorum drive (assuming it is not a local quorum) is on a RAID volume. This particular cluster uses iSCSI as its SAN. When the iSCSI server was initially set up (not by me), no RAID was used. I knew that it would eventually have to be changed, but due to the setup, I just kept pushing it off because, well, I didn't want to have to redo the entire cluster. Of course, one of the hard drives in the system failed. After letting the cluster sit for a while (offline, of course), I decided to tackle it and get it fixed. The first priority was getting the iSCSI server fixed. It was set up properly, with a RAID5 array for the hosts to use.

Tip 2: Make sure that you have the Windows (2003) Resource Kit downloaded and installed. There are some awesome utilities included in the Resource Kit. If you've worked with Windows long enough, you'd remember that there used to be actual Books and CD's for resource kit materials, and they weren't free! Thankfully, the Windows 2003 Resource Kit tools are freely downloadable. They came in VERY handy!

Armed with my knowledge, and my l337 searching skills, I began to search for how to fix my problem. In this case, I couldn't even start the cluster service. Attempting to do so generated an error indicating that the drive signature that was expected for the quorum couldn't be found. Duh. It wasn't coming back, either! Anyway, I came across several articles (Thanks to EventID.net) that helped me with this. The error I was seeing was event 1034 with a source of ClusSvc. Here are some of the resources I consulted.

http://technet2.microsoft.com/WindowsServer/en/Library/b410b421-78a5-4b3f-9247-e4f248f878fc1033.mspx?mfr=true http://technet2.microsoft.com/WindowsServer/en/Library/e2e5674c-0625-4aba-afee-0c7057f8ac2e1033.mspx?mfr=true http://technet2.microsoft.com/WindowsServer/en/library/3974d0c5-1c3f-4dce-921c-2859a8abd8ae1033.mspx?mfr=true

With the help of these articles, I was able to start up the cluster service with the /fixquorum switch (to do this, from a command prompt, use net start clussvc.exe /fixquorum), which allows the cluster service to be started. As anyone who works with clusters understands, until that cluster service is started, there really isn't much that can be done. With the cluster service started, I was then able to create a new physical disk resource, and then use the clusterrecovery tool, which is available in the aforementioned Windows 2003 Resource Kit utilities (Free download). The clusterrecovery tool allows you to replace a failed quorum drive by choosing another physical disk resource to replace it with. Once it does it's magic, it appends (lost) to the old Quorum resource and instructs you to delete it.

Of note is that once you are done with this maintenance, you need to restart the cluster service, otherwise it will continue running in the fixquorum state. Also of note here is that when replacing the quorum disk using clusterrecovery, ALL resources must be in an offline state (or failed, as it was in my case – hehe). Because of this, when you try to connect to the cluster, you must specify the physical node name as the cluster name, not the actual cluster virtual name (the Network Name resource is offline, so the cluster virtual name won't respond).

All done with this, I tested failing over to the other node to make sure that the resources all came online – they did. I then went about fixing the Exchange Virtual server, which was much easier. I'm pretty sure that I've blogged about that in the past, so I'll skip it this time, other than to say that the physical disk the Exchange virtual server was using had also failed (NEVER use RAID0 on a production server!), so I went through the process of creating a new physical disk resource for that, then deleting/re-creating the Exchange System Attendant Resource to fix the MS Search problems. I lost all the databases for that virtual server, but again, since it was a dev system, the data really didn't matter.

- posted by Ben Winzenz @ 4:46 PM

Comments:

Saved my bacon. Thanks.

# posted by

Anonymous : 8:41 PM, June 07, 2007

My Blog

About Me

Subscribe to this Blog

Contact Me

Other Blogs

Search

Archives