Issue Adding RDM LUN to Exchange 2010 Server Using VMWare vSphere and NetApp

**UPDATED on January 6, 2012**

Thanks to both user comments and NetApp themselves, we have determined that there is an easier way to add disks to members of a DAG without removing them from the DAG.  You can simply stop the Windows Clustering service (run the command “net stop ClusSvc” from the command line).  Obviously you should move any active mailbox databases on this DAG member to another DAG member before doing this, as stopping the clustering service will cause the databases to dismount and move on their own.  From there, you should be able to add the disk’s, start the clustering service, and your DAG member will automatically return to a normal operating mode, participating in the DAG.

Thanks to all who sent this in!!

**Original Article**

I recently ran into an issue where I was unable to add an RMD LUN to a Windows Guest running on VMWare vSphere.  Here is my setup.

I had a Windows 2008 R2 guest that was running Exchange Server 2010 SP1.  The guest was a Mailbox server that was a member of a Database Availability Group.  I was attaching the LUN’s to iSCSI RDM’s that were based on a NetApp FAS 3140 running ONTAP 7.3.2.  The guest was running version 6.3 of Snapdrive.

The guest had 12 iSCSI RDM’s working properly for month’s, but the issue arose when I tried to add more.  I would be able to select the volume, create the LUN, size, the mounting location of the LUN.  The issue was when in Snapdrive I was presented with where to store the RDM file for the VM.  See the screen below:

image

The problem was the console starting freezing up, and generally not responding.

image

After several minutes, I eventually received an error stating there was an “error in fetching number of vmfs datastores”

clip_image002

I tried all the basic’s, re-installing Snapdrive, upgrading to Snapdrive 6.3 PP1, rebooting the host, stopping and starting the service.

Turns out there is a bug in Snapdrive that causes the error above, when the Guest is a member of a Windows Cluster.  Since all DAG members utilize Windows Clustering, this applied to me.   The resolution was easy.

I moved all the databases off of the server in question.  Then, in Exchange Management Tools, I went to Organization Config->Mailbox and selected the Database Availability Tab.  Right click your dag and select Manage Database Availability Group Membership:

Untitled

Right click the server in question, select Delete and then the manage button.  This will remove the server from the DAG.

[This will not cause any issue with the existing databases as we’ll see below]

Now, go back into Snapdrive and add your LUN’s, all should be working now.

After your done, add the server back to the Database Availability group, almost the same way you removed it, this time select Add, and then select the previously removed server and add it back.

Next, for each MailboxDatabase that the server has copies of, run this command in the Exchange Management Shell:

Add-MailboxDatabaseCopy –Identity MB01 –MailboxServer NYDAGNODE1

Or in the EMC, go to Organizational Configuration->Mailbox and right click each Mailbox Database and select Add Database Copy.  Then select your server.

Since the server still has copies of the Mailbox Databases, it will start to resynchronize with the DAG, and bring itself up to date.  That way you won’t need to reseed your entire DB which can take some time.

Hope someone finds this useful!

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

31 Responses to Issue Adding RDM LUN to Exchange 2010 Server Using VMWare vSphere and NetApp

  1. Kiran says:

    Hi,

    Very interesting article, thank you very much, i would like to know if you ever had to connect to a snapshot or had to restore the DB, as in that case the snapdrive would fail to connect to a snapshot as it would not be able to write the mappings on to the vmfs. please let me know ur input.

    many thanks

  2. ponzekap2 says:

    Kiran,

    Thats a good point. Recovering using Snapmanager for Exchange, in a case where you returned to a previous snapshot is handled by replacing the LUN on the volume, so that should work without issue. Connecting to the snapshot would most likely cause the same issue as above. In our case, we have dedicated backup/recovery servers that we use, so we wouldnt be subject to the issue. Let me know if it does happen to you though.

  3. What am I doing wrong?

    How are you able to use SnapDrive inside your Windows 2008 R23 Guest?

    Is it becuase I am using FC with NPIV with a Windows 2008 VM Guest?

    Need iSCSI?

    Cisco UCS, vSphere 4, broadcade, NetApp 3020.

    • ponzekap2 says:

      You can use snapdrive with rdms over fcp. You need to attach the lun in the guest through snapdrive though. You don’t present it to the vm like a typical rdm in vmware. Attach it from inside the guest. It will ask for the esx wwpn.

  4. Urban Shocker says:

    I must be doing something wrong or have something setup wrong as I don’t even see the RDMs in SnapDrive.

    What about using SnapManager for Exchange over FCP?

    • ponzekap2 says:

      Urban,

      Yea, using RDM LUN’s allows you to leverage the SnapManager programs. How did you originally attach the RDM LUN’s? Are they showing up as Netapp LUN’s or VMWare LUN’s? you should be getting an error in the event log about being presented through an unsupported initiator. Make sure your on Snapdrive 6.3, and that you attach them using snapdrive. If they are existing LUN’s, disconnect, and reconnect using snapdrive console.

  5. I was successful in getting SnapDrive installed and configured on a Windows 2008 R2 guest. I’m able to map a current LUN to the guest and / or create a new LUN from the guest. I have not attempted to expand the LUN yet bet I have no reason to doubt that will work as well.

    Thanks for all the information and assistance. 🙂

    Where I’m getting hung up is migrating the guest to a new host (vMotion).

    During vMotion, I receive a message that guest has “Virtual disk ‘Hard Disk 2’ is a mapped direct-access LUN that is not accessible” from the other hosts. According to the SnapDrive 6.3 for Windows – Install and Configuration guide…there is a NOTE that indicates when you perform a VMotion operation, the RDM LUN validation might fail. Perform an HBA rescan from the virtual infrastructure client and retry the operation. I’ve tried that, but still seeing the same message during the guest migration.

    I verified SnapDrive has the proper settings for VirtualCenter or ESX Server Log On. It’s currently pointed to vcenter using the credentials (which is the full on admin account).

    I’ve tried connecting the LUN as shared vs. physical, but that does not allow me to select the appropriate iGroup (one containing WWPN of all ESX hosts). I have one thing left to try…and that is store the RDM vmdk (mapping file) on a specific DataStore vs. storing it with the VM. However, I don’t think that’s going to do it. I’ve also tried rebooting the guest and migrating while the server is down. Same error.

    There is no reference in the SnapDrive manual of NPIV, so that needs to be disabled…maybe that will do it.

    Any thoughts?

  6. Gary says:

    This is somewhat off topic but what size did you make the luns in relation to the volumes when using Snapmanager for Exchange?

    • ponzekap2 says:

      Hey Gary,

      With Netapp, unless you follow their specific formula, I usually start with the volume being 2.2 * the size of the LUN and then adjust for snapshot retention and growth after the mailbox moves. Currently on one of my DB’s, the size of the LUN is 100 GB and the volume size is 300 GB, which gives us 30 days of snapshot retention. My TL volumes are 55 GB, with 25 GB TL LUN’s, but I have a script set up to routinely purge all snapshots of the TL LUN’s and Snapinfo LUN’s as I do not feel they are necessary.

  7. Gary says:

    The above issue apparently seems to be around. I’m seeing the same behavior on my server. Have you contacted NetApp about this? It seems odd that you’d have to jump through this hoop.

  8. Nathan says:

    Does anyone know if this bug still exists? I am running into the exact same issue – can’t mount snapshots with SDW… The big problem is that this issue prevents SME backups from being verified becuase the snapshot cannot be mounted via SDW… any updates would be appreciated. Thanks,

    • Gary says:

      Yes. It does still exist. I received this from Netapp support this morning:

      Looking into this I found a bug from November:
      http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=551189
      This is still being tested, so there’s not much info available. You might want to click the “WATCH this bug” button at the bottom of the page to be notified when we have any updates.
      In the meantime, I was able to find a workaround in our internal documentation that is much simpler than the one in the blog you mentioned:
      1) Disable the cluster service on all nodes in the cluster
      2) Create the LUNs in SnapDrive
      3) Re-enable the cluster service
      Hopefully that information will help you to resolve this issue. If so, please just send me a brief reply and I will archive the case. Please let me know if you need anything further from me regarding this issue.

      • Nathan says:

        Gary, thanks for the reply. I understand your work around for adding a new LUN, but the problem I am having is actually when SME runs and tried to verify the backup, SDW fails becuase it can’t mount the snapshot copy to run the verification. SO, the backup job fails and doesn’t flush the logs. And, if I run the job wiuthout verification, then I can’t restore Exchange data from an unverified snapshot backup… so that is the real issue I have becuase of this SDW issue. Thanks again,

      • Gary says:

        Had the same problem at first. Create your Luns and disks using snap drive and they should verify fine. If you create them from system manager or the web console you may get the vss issue.

      • Nathan says:

        Yeah, already there. All LUNs for each DAG member were created with SDW before the DAG was created. no issues. now with DAG configured, SME backup verification fails and it appears to be SDW not being able to mount a snpshot copy to verify with…? Thanks for the quick reply,

      • ponzekap2 says:

        What about doing remote verification on a machine not in the cluster? It doesn’t need exchange installed just the management tools and snap drive.

      • Nathan says:

        That is an option… I thought I needed Exchange installed on the verification server. If not, I can use the SMBR server – it has SDW and I can put the Exchange Mgmt tool on it… Still disappointing it is not working with DAG members. I’ll reply with results as soon as I am able to test. Thanks,

      • ponzekap2 says:

        Vss verifications all use eseutil.exe to verify backups. So all you really need is eseutil.exe on the remote server. Just make sure you are usin the correct version for your exchange version. Exchange 2010 eseutil.exe is different than 2007’s.

  9. Nathan says:

    Sorry, forgot to say I am running SDW 6.3.1R1 with SME on a 2010 DAG

  10. Gary says:

    I’m also still curious on trying to map out these new LUNs and volumes in my environment. I’d like to keep no more than 250 mailboxes per databases for a total of 700 users. It’s hard to calculate the right sizes according to the Exchange calculator since Netapp is involved with deduplication, snapshotting, etc and I question whether it’s the right call. I’m tempted to just make a real big lun rather than being sorry later but I’m the kind of person who’d prefer to be precise based on a formula to calculate 20% growth, 32 snapshots (30 days plus 2 months), etc.

    • ponzekap2 says:

      Gary,

      There should be care taken with making ONE big LUN with Snapmanager for Exchange. Netapp snapshots are based on the volume. So if you have 10 DB on one big LUN, thats in one volume. Any Snaprestore’s that will be done will force ALL DB’s to be reverted to that snapshot, and then played forward using transaction logs. This means one DB could affect the other 9. Also, if your doing LUN’s, there is no point to use deduplication unless you thin provision the LUN on the Netapp side, and this is not recommended with Microsoft Exchange.

      For the sizing, not sure if you saw this formula from the Snapmanager Exchange Admin Guide. This is for DB’s:

      DB Volume = [2*LUN SIZE] + [Number of Backups Stored * %data Change * Max Database Size]

      So for yours, if your users have 2 GB Mailboxes, we can assume them 3 GB of size per users (with the dumpster and such). Then its 250 * 3GB = 750 GB database. So for the LUN I’ll assume a 1 TB LUN.

      DB Volume = [2*1024GB] + [32 * 20% * 750 GB]

      DB Volume = 4800 GB

      Obviously you can change this if the MB size per user is different, and then that would affect the LUN size and max DB size.

      Good find on the clustering service BTW. Ill edit the article to let everyone know.

      • Gary says:

        No way! I’m looking at the SME 6.0.2R1 admin guide and that formula isn’t in there. I’ve been looking for that and scratching my head trying to figure this out. I will make my calculations accordingly now. Also, I’m not planning on putting all databases on 1 lun. I’m following best practices (well what I think is BP at this time) and putting each DB and set of logs on their own individual luns. Recommendations seem to change with the wind from all vendors so it’s hard keeping up sometimes. Thanks so much for your help!

  11. Gary says:

    How are you calculating the max database sizes available for your environment? For example we have aggregates that have data other than Exchange on them so trying to calculate the available iops that’s left over is throwing me.

  12. Gary says:

    I’m trying to incorporate MSFT formulas into this as well. That’s making it more difficult.

    http://technet.microsoft.com/en-us/library/ee832789.aspx#DCR

  13. Gary says:

    I wanted to pass along some important message I received yesterday from someone at NetApp who trains partners. He said that in 2003 and 2007 you did not want to thin provision but since 2010 got rid of SIS you definitely want to TP as well as deduplication. In my environment since I have a DAG copy in a remote site he even recommended that I have 1 volume in the main site with 2 Luns in it each containing a db copy. He said deduplication will will take care all but a small amount of additional space. The benefits here in lieu of this information for me are great and it definitely makes the planning much easier. You guys may want to reconsider your design. Lastly, he said each database should be up to 2 TB to take advantage of the performance changes in 2010. If you split them up and you use DAGs you are in a way wasting space. Since my entire environment is less than 1 TB it makes this implementation a piece of cake. He recommended to keep 5-7 days worth of full snapshots in SME and 30 days of just log copies and do the same in the remote site.

  14. Nathan says:

    Gary, that is interesting info and good to hear/know! All the guides I have read up until now suggest not using Thin Provisioning and I had not heard the multi-LUN per Vol recommendation nor the DB sizing up to 2 TB… Wish I would have known that before this was all designed and implemented… At least I know moving forward.

    Also, I wanted to let you guys know that using a different server (in this case the SMBR server) to perform SME DB Backup verifications is working. So, this at least provides me a work around for now – thaks for the help!!

    I now have SME backups, verifications and restores working along with SMBR. I did open a case with NetApp to resolve my issue using SME for verification from a DAG member, so I’ll post any info I get from working the case.

    I realize it is probably a more desirable design to have verification off-loaded, but I still want an answer to why it won’t work in the DAG or when it is going to be fixed.

    Thanks again guys – happy holidays!

  15. Matt says:

    I’m running into the same issue with trying to create new LUN’s on a Exchange 2010 mbx server already part of the dag. I really don’t want to go through the issue of removing the servers from the dag. Is there a fix from NetApp?

    • ponzekap2 says:

      Hey Matt,

      There is, stop the clustering service on the server. See the updated article I released at the top of the original post.

  16. JJ says:

    Also, pause the node in Failover cluster manager or the cluster service may keep restarting.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s