How to Manage a Datacenter Failure or Disaster Recovery Scenario in Exchange 2010 – Part 1

 

Exchange 2010 introduced several high availability and disaster recovery features, the one that receives the most publicity is the Database Availability Group (or DAG for short) feature.  In short a DAG allows replication of a mailbox, to other servers in the DAG, that can be activated automatically within 30 seconds, restoring user access to their mailbox’s.  For more information see my article series on DAG’s here.

The automatic failover is great for High Availability within a datacenter, or even across a datacenter.  For instance, consider the following diagram:

DB

Here, the green copies are the Active Copy, they are the one’s users are actually accessing for their mailbox’s.  The yellow and the red are copies that can be activated, should the Active Copy go offline.  Consider the possibility that MDB01 on server NYMB01 goes offline, the copy on NYMB02 would be activated within 30 seconds automatically.  Next, the drive holding the database MDB01 on server NYMB02 fails, causing THIS copy to go online.  In this case, the copy of MDB01 on DRMB01 in Boston would be activated with 30 seconds, and users would be able to access their mailbox’s, across the WAN link to Boston!  This is all part of the design of the DAG, and is great from a High Availability standpoint. 

But, as we know, High Availability and Disaster Recovery are COMPLETELY separate.  High Availability means to provide your users with high uptime, or access to the application.  Disaster Recovery is the ability of the application to function when a catastrophic event happens, such as destruction of a datacenter or worse, the building holding the datacenter.  This last part, is what we will cover in these articles.  To do so, there is a feature of DAG’s that we need to talk about, and that is the Datacenter Activation Coordinator, or DAC.

DAC is a setting on a DAG that has three or more member mailbox servers, that are extended to multiple sites.  So, a higher level view of our Exchange environment is below:

How to Manage a Datacenter Failure 2010

NYMB01, NYMB02 and DRMB01 are all part of the same DAG, lets call it DAG1, and all these servers are located in the NYGIANTS.COM domain.

Now, our DAG fits the criteria for DAC mode, which is three or more member servers, spread across multiple Active Directory Sites.  So, now what is DAC mode?

DAC mode is, quite simply, a mechanism to prevent the possibility of a split brain in your exchange environment.  Consider the following scenario.  As per the first diagram, you have MDB01 and MDB02, and both active copies run in NY on NYMB01 and NYMB02 respectively.  NYHT01 is running the file share witness.  A file share witness is a server that only participates if the DAG has an even number of servers, it’s used to “break” any tie in voting regarding if a server is down or not.  The NY site is connected via WAN connection to Boston, where DRMB01 hosts replica’s of MDB01 and MDB02.  Say there is a cut to the WAN connection, and for whatever reason, NY and Boston can no longer communicate, but neither side is truly offline.  The Boston side, since it can no longer connect to the NY server’s, assumes they are down, and mounts the database copies it has of MDB01 and MDB02, and marks them as active.  Since NY is still operational, it STILL has its copies of MDB01 and MDB02 mounted and active.  This is a split brain scenario, both sites believe that they are the rightful owner of the database, and have thus mounted their respective DB’s.  This would cause a divergence in data.  For example, if outside user, sends an email to a user at nygiants.com, and its received in NY, it would get delivered to his mailbox in NY.  If another user sends the same user at nygiants.com an email, and it gets received by Boston, it would get delivered to that’s users mailbox in Boston.  Each mailbox is different, which is a huge problem, this is the issue with a split brain scenario, and is what DAC was built to protect against. 

DAC does this by preventing the DR servers from mounting their databases.  DAC requires that a majority set of the DAG members be available for the DAG to be able to make an operational decision, in this case the DR servers mounting their database.  A DAG that has the majority of its member servers is said to have Quorum.  So, in our previous example where the line was severed, DR would NOT mount its database’s.  Why not?  Because the DAG consisted of 3 total members, NYMB01, NYMB02 and DRMB01.  What this means is that according to DRMB01, its the only surviving server, which is 1 out of 3, and is not a majority, hence it cannot mount its database. Now, if you look at the first diagram, you will notice that MDB03, is green on DRMB01, meaning that the active copy of MDB03 is running on DRMB01.  Well, what happens in this scenario, where the WAN connection was cut?  Wont one of the NY servers mount MDB03?  Since DRMB01 has MDB03 already mounted, wont this cause the EXACT split brain scenario we are trying to avoid?  No.  Why not?  Remember how I said that the DAG needs to be able to make Quorum?  Well, in this case, since DRMB01 cannot make Quorum, it is forced to dismount any database that it has running.  In the event log, you’ll see the following message:

02-Dec10 19.43

So, DRMB01 dismounts MDB03, which is mounted and activated in NY.  This is how the split brain scenario is avoided. 

So what does this mean if there really is a need for a datacenter failover?  At one site I work at, there was a broken pipe in the tenant above them, causing a flood that threatened to destroy their datacenter.  If the datacenter had been destroyed, how do we activate DR?  We’ll go over that in Part 2 of this series.

For this article, we discussed mainly the theory and thought process behind DAG’s, Datacenter Activation Coordinator, and the concept of Quorum with regards to the cluster. In the next article, we’ll jump in and do an actual datacenter failover. 

Advertisements
This entry was posted in Exchange 2010, High Availability and tagged , . Bookmark the permalink.

3 Responses to How to Manage a Datacenter Failure or Disaster Recovery Scenario in Exchange 2010 – Part 1

  1. Safdar Raza says:

    very good article

  2. Alan says:

    Thanks for the great set of articles. One question for you, if you had the following configuration…

    2 sites (both containing users),
    A DAG spanning the sites,
    2 copies of each exchange database at each site
    The active databases for users working from site 1 based on site 1 servers
    The active databases for users working from site 2 based on site 2 servers
    A witness server located in site 1

    What would happen should the WAN link between the sites fail? Would the site containing the witness server decide that it now had quorum and make all of the databases active that were previously active in site 2 and would the site 2 servers take all of their databases offline?

    Thanks again

    Alan

  3. Rahul Sharma says:

    Really good article.!!

    @Alan, If you loose the WAN- The site that doesn’t have the quorum will dismount its databases. So if you have 4 servers in Site A and 3 Servers in site B, Site A will remain up and Site B will go offline as the site B has lost the quorum. ( less than 1/2 of the members)
    I have presume you have DAC enabled in this configuration.

    Cheers,
    Rahul

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s