Friday, December 23, 2005
Sometimes the easiest problems are the most difficult
I work a lot with our developers and QA folks. Heck, I maintain all of their environments, so I have to work pretty closely with them. Most of the problems that occur are fairly easy to fix, but I had one crop up today which absolutely stumped me for a while.
The environment is a 2-node Exchange 2003 Active/Passive cluster. It was reported to me that all of a sudden after a failover, something in the cluster wasn't working, and it was preventing QA work on that cluster. Because one of the cluster resources was failing, they couldn't use our product to manage/fail over the cluster. OK, I say, I'll have a look.
So, I RDP to the server and look at Cluster Admin, and I find that the Message Transfer Agent resource is failing to come online. Let's see what happens if I just try to bring it online. Bzzzzt - thanks for playing. Ok, so I check the event logs. I see a Event ID 9400 for the Exchange Virtual Server saying that it is unable to open a particular file. That's odd. Ok, there has to be a KB article relating to this. Sure enough, I find http://support.microsoft.com/default.aspx?scid=kb;en-us;154385, which tells me that the file may be missing or corrupt.
So, I look in the c:\program files\exchsrvr\mtadata directory, thinking that I'll find that the specified file isn't there and it will be an easy fix. Well, it's present in the directory, so I decide to replace it. Copied it from the CD, removed the Read only attribute (that is important!) and tried to bring the resource online. No joy. I went around and around for a while, even replacing ALL of the files in the mtadata directory from the CD. Still doesn't work. I had put it on hold for a while and decided to look some more this evening. Oddly enough, it was a KB article written for Exchange 4.0 of all things that allowed me to figure out the problem. http://support.microsoft.com/default.aspx?scid=kb;en-us;162384 discusses a series of steps of how to troubleshoot MTA failures. One of the parts of the article talks about checking the parameters section of the registry associated with the MTA. HKLM\System\Current Control Set\Services\MSExchangeMTA\Parameters is where it's at. As soon as I looked at that, I spotted 2 values that didn't look right. The MTA Database path and MTA Run directory both pointed to another drive.
When I looked at the directory that those 2 values pointed to, I noticed that there were some missing files. I didn't have time or the desire to find out at that point how they became missing, but I was able to solve the problem by copying the files from the c:\program files\exchsrvr\mtadata directory to the directory specified in the registry. Once I did that, I was able to bring the MTA resource online and was able to successfully fail the cluster over to the other node and have the MTA resource come online as well.
Lesson learned: Don't forget to check the basic stuff. If I would have thought to check the registry, I probably would have saved a lot of time trying to figure out this problem.