I was working on a client’s Exchange 2010 setup, working on a specific issue. The client had a multi site stretched DAG across two datacenters. One datacenter was the “primary”, and then they had a second disaster recovery datacenter. In the primary datacenter they had two mailbox servers, MBX01 and MBX02 each with a copy of the database, and the secondary had a single mailbox server, DRMBX03 with a copy of the database.
The client was experiencing an issue where databases would suddenly failover within the primary site from MBX01 to MBX02, and report that the cluster lost quorum.
I took a look at the cluster log, which can be generated by running the command:
Cluster log /g /copy:LogFolder /span:120
The span entry specifies the amount back in minutes that the log is generated for. Just a hint, if you run this from the root of the C: drive, it will copy the logs to the C:\LogFolder location.
Within that folder you’ll find a separate log for each of the servers in the DAG, in our case MBX01, MBX02 and DRMBX01:
When I opened the logs, I began to see 1226 and 1236 errors in the log:
These errors are specifically handled by the following hotfix for 2008 R2 Failover Clusters (which is what Exchange 2010 runs on top of):
These were recently released and talked about by the Exchange Team, along with these two other hotfixes:
After applying each of these hotfixes to EACH and EVERY single DAG node and rebooting, the issue was resolved. These hotfixes are recommended to ANYONE running Exchange 2010 on 2008 R2, regardless if your seeing the issues or not.