Quantcast
Channel: Tim McMichael
Viewing all 161 articles
Browse latest View live

Exchange 2010: The mystery of the 9223372036854775766 copy queue…

$
0
0

Database Copy Self-Protection Mode

 

What would you say if I told you that a copy queue length (CQL) of 9 quintillion log files was not a bad thing? By the way, if you are wondering, that works out to a CQL of 9,223,372,036,854,775,766, a number so big that it cannot be entered into Windows Calculator, and so big that if it represented actual transaction log files, it would require 8 yottabytes of storage, and sadly, as of this year, no storage or network system on the planet has reached even one thousandth of a yottabyte (which would be a zettabyte, by the way) of information.

 

But fear not, as you don’t need to start collecting all of the storage on the planet to plan for a CQL this large, but it is a real and valid value for a passive copy’s CQL, and in recent weeks, I’ve worked with a few customers who have experienced this.

 

Background

 

During regular operations, the Microsoft Exchange Information Store service (store.exe) and the Microsoft Exchange Replication service (msexchangerepl.exe) on DAG members hosting an active database copy write two values to the cluster registry at HKLM\Cluster\ExchangeActiveManager\LastLog:

 

  • DatabaseGUID with a decimal value representing the last log generated by the active copy
  • DatabaseGUID_TIME with the system time of when that log file was generated

 

Here is an example of these entries in the cluster registry:

 

image

 

To decipher these entries, you can use Get-MailboxDatabase to get the GUID for a mailbox database:

 

[PS] C:\Windows\system32>Get-MailboxDatabase dag-db0 | fl name,*guid*


Name : DAG-DB0
Guid : 2abcac37-1b5d-4b9c-8472-e33c65379698

These values are written to the cluster registry on the server hosting the active copy, and native cluster registry replication is used to propagate this information to all other DAG members. DAG members that host a passive copy of the database use this information (the last log file generated by the active copy) along with information about the last log file replicated to the passive copy, to determine the CQL for the database copy. Thus, it is critical that all DAG members have up-to-date values, as the CQL is used by Active Manager to evaluating whether or not to mount a database in response to a failover.

 

A Larger-than-Life CQL

 

In Exchange 2010 SP1, we changed how we determine the CQL. In addition to tracking the last log generated by the active copy, we also track the last time that log file was generated. This was done specifically to prevent situations in which a passive copy is not aware of log files generated by the active copy and makes automount decisions based on stale data.

 

We use the DATABASEGUID_Time entry for this purpose. If the difference between the timestamp recorded in the cluster registry and the system time on the server hosting the passive copy is off by more than 12 minutes, the Microsoft Exchange Replication service on the server hosting the passive copy places the database copy into a self-protection mode. This is done by setting the CQL for that passive copy to 9223372036854775766. Because a passive copy can be activated and automatically mounted only when its CQL is equal to or less than the configured value for AutoDatabaseMountDial, this has the effect of preventing the passive copy from ever mounting automatically. After all, a value of 9223372036854775766 will always be higher than any possible value for AutoDatabaseMountDial.

 

Where Did the Time Go?

 

So why would a condition exist that causes the time discrepancy to be greater than 12 minutes in the first place? This can actually happen for a few reasons:

 

  • The Cluster service on the server hosting the active copy might be having a problem writing updates even though the node remains in cluster membership.
  • The Cluster service on the server hosting the passive copy might be having a problem receiving updates even though they remain in cluster membership.
  • The Information Store service and Exchange Replication service could be stopped on the server hosting the active copy. (Remember that a copy that is “active” simply signifies the node that owns the copy not the actual mount / dismount state of the database).
  • A datacenter switchover is being performed and more than 12 minutes have elapsed between the time when the failed DAG members were stopped and when the remote DAG members were activated.

 

What to Do When CQL Reaches 9223372036854775766?

 

To recover from this situation, an administrator can simply perform a database switchover. Note, though, that the switchover will need to be performed using the Exchange Management Shell, as the administrator will need to force the move by using multiple switches and the MountDialOverride parameter. Because the following command skips all of the built-in safety checks for a passive copy, it should be used only when you know that the copies to be activated were healthy prior to the copy going into self-protection mode.

 

Attempted move without overrides:

 

Move-ActiveMailboxDatabase DAG-DB0 -ActivateOnServer DAG-2

Confirm
Are you sure you want to perform this action?
Moving mailbox database "DAG-DB0" from server "DAG-1.domain.com" to server "DAG-2.domain.com".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): y

Identity        ActiveServerAtS ActiveServerAtE Status     NumberOfLogsLost   RecoveryPoint MountStatus MountStatus
                tart            nd                                            Objective     AtMoveStart AtMoveEnd
--------        --------------- --------------- ------     ----------------   ------------- ----------- -----------
DAG-DB0         dag-1           dag-1           Failed                                      Dismounted  Dismounted
An Active Manager operation failed. Error The database action failed. Error: An error occurred while trying to validate the specified database copy for possible activation. Error: Database copy 'DAG-DB0' on server 'DAG-2.domain.com' has a copy queue length of 9223372036854725486 logs, which is too high to enable automatic recovery. You can use the Move-ActiveMailboxDatabase cmdlet with the -SkipLagChecks and -MountDialOverride parameters to move the database with loss. If the database isn't mounted after successfully running Move-ActiveMailboxDatabase, use the Mount-Database cmdlet to mount the database.. [Database: DAG-DB0, Server: DAG-2.domain.com]
    + CategoryInfo          : InvalidOperation: (DAG-DB0:ADObjectId) [Move-ActiveMailboxDatabase], AmDbActionWrapperException
    + FullyQualifiedErrorId : 3F936D4B,Microsoft.Exchange.Management.SystemConfigurationTasks.MoveActiveMailboxDatabase

 

Successful move with overrides:

 

Move-ActiveMailboxDatabase DAG-DB0 -ActivateOnServer DAG-2 -SkipHealthChecks -SkipActiveCopyChecks -SkipClientExperienceChecks -SkipLagChecks -MountDialOverride:BESTEFFORT

Confirm
Are you sure you want to perform this action?
Moving mailbox database "DAG-DB0" from server "DAG-1.domain.com" to server "DAG-2.domain.com".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): y

Identity        ActiveServerAtS ActiveServerAtE Status     NumberOfLogsLost   RecoveryPoint MountStatus MountStatus
                tart            nd                                            Objective     AtMoveStart AtMoveEnd
--------        --------------- --------------- ------     ----------------   ------------- ----------- -----------
DAG-DB0         dag-1           dag-2           Succeeded  922337203685472... 5/29/2012 ... Dismounted  Dismounted

 

At this point, the database can be mounted on the remote server after moving the active copy.

Of course, the ultimate question is: why 9223372036854725486? The value for CQL must be a 64-bit integer that cannot be null; therefore, we chose a value close to maxInt64 that is so large that it prevents a potentially out-of-date copy from being activated.


A good read on VSS backups…

Another good read on vss backups…

Exchange 2010: Adding database copies to databases with a single copy results in copy status failed after seeding is completed…

$
0
0

In recent weeks, I’ve worked with customers that have experienced database copies that go into a failed state immediately after adding a copy.  In all cases the following circumstances were noted:

 

  • The database previously only had a single copy.
  • The single copy database had a backup performed on it <or>
  • The single copy database utilized circular logging.

 

Let us explore the first scenario – single copy databases that have had a full backup performed. 

 

Using eseutil /k ENN I examined the log files that were present in the log directory.  Here is a sample output:

Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000001.log – OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000002.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000003.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000004.log - OK

Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000020.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000021.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000022.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000023.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000024.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000025.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000026.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000027.log - OK


Using Window Server Backup, a full backup was performed of the single copy database.  This results in log truncation of the log stream when the backup completes successfully.  This is noted by the log truncation event present in the application log.

 

Log Name:      Application
Source:        ESE
Date:          7/1/2012 11:59:18 AM
Event ID:      224
Task Category: ShadowCopy
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-1.exchange.msft
Description:
Information Store (3080) Mailbox Database 0958497998: Deleting log files C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000001.log to C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000028.log.
 


For more information, click http://www.microsoft.com/contentredirect.asp.

 

Using eseutil /k ENN the log sequence after backup was verified.  Here is the sample output:

 

Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000029.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000002A.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000002B.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000002C.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000002D.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000002E.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000002F.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000030.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000031.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000032.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000033.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000034.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000035.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000036.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000037.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000038.log - OK

 

As confirmed by the event and verification of the log sequence log truncation occurred successfully.

 

Using the Exchange Management Shell a copy of the database was added to the other DAG member.  The Exchange Management Shell confirms that the database seeding was successful.  The following is a sample copy status:

 

[PS] C:\>Get-MailboxDatabaseCopyStatus "Mailbox Database 0958497998\MBX-2" | fl


RunspaceId                       : 14523823-d7fe-4e42-9558-9395f418e6d0
Identity                         : Mailbox Database 0958497998\MBX-2
Name                             : Mailbox Database 0958497998\MBX-2
DatabaseName                     : Mailbox Database 0958497998
Status                           : Failed
MailboxServer                    : MBX-2
ActiveDatabaseCopy               : mbx-1
ActivationSuspended              : False
ActionInitiator                  : Service
ErrorMessage                     : The required log file 32 for Mailbox Database 0958497998\MBX-2 is missing on the act
                                   ive copy. If you removed the log file, please replace it. If the log file is lost, t
                                   he database copy will need to be reseeded using Update-MailboxDatabaseCopy.

ErrorEventId                     : 2059
ExtendedErrorInfo                :
SuspendComment                   :
SinglePageRestore                : 0
ContentIndexState                : Healthy
ContentIndexErrorMessage         :
CopyQueueLength                  : 72
ReplayQueueLength                : 0
LatestAvailableLogTime           : 7/1/2012 12:09:32 PM
LastCopyNotificationedLogTime    : 7/1/2012 12:09:32 PM
LastCopiedLogTime                :
LastInspectedLogTime             :
LastReplayedLogTime              :
LastLogGenerated                 : 72
LastLogCopyNotified              : 72
LastLogCopied                    : 0
LastLogInspected                 : 0
LastLogReplayed                  : 71
LogsReplayedSinceInstanceStart   : 0
LogsCopiedSinceInstanceStart     : 0
LatestFullBackupTime             :
LatestIncrementalBackupTime      :
LatestDifferentialBackupTime     :
LatestCopyBackupTime             :
SnapshotBackup                   :
SnapshotLatestFullBackup         :
SnapshotLatestIncrementalBackup  :
SnapshotLatestDifferentialBackup :
SnapshotLatestCopyBackup         :
LogReplayQueueIncreasing         : False
LogCopyQueueIncreasing           : False
OutstandingDumpsterRequests      : {}
OutgoingConnections              :
IncomingLogCopyingNetwork        :
SeedingNetwork                   :
ActiveCopy                       : False

The application log on the server on which the copy was added also has the following event:

 

Log Name:      Application
Source:        MSExchangeRepl
Date:          7/1/2012 12:09:35 PM
Event ID:      2059
Task Category: Service
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBX-2.exchange.msft
Description:
The required log file 32 for Mailbox Database 0958497998\MBX-2 is missing on the active copy. If you removed the log file, please replace it. If the log file is lost, the database copy will need to be reseeded using Update-MailboxDatabaseCopy.

 

Note:  The log file name displayed in this event is a decimal value.  The value is converted from decimal to hexadecimal in order to identify the log file.  Log file names are in hexadecimal format.

 

If the seed is attempted again using update-mailboxdatabasecopy the same error is displayed.  In this case log file 32 (0x20) is necessary to satisfy the seeding request and the lack of the log file will prevent the seed from being successful.  Why is this log file necessary and why is it missing from the active copy?  We’ll answer that question a little later…

 

Let us explore the second scenario – single copy databases that is using circular logging.

 

In this case there is a database with a single copy.  This database has had circular logging enabled. 

 

Using eseutil /k I examined the log files that were present in the log file directory.  Here is a sample output:

 

Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000049.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000004A.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000004B.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000004C.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000004D.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000004E.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000004F.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000050.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000051.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000052.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000053.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000054.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000055.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000056.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000057.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000058.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000059.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000005A.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000005B.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000005C.log - OK

 

At this point circular logging was disabled, and the database was dismounted and then mounted.  This is a requirement for databases with a single copy before adding an additional copy.

 

Using eseutil /k I examined the log files that were present in the directory prior to adding the copy.  Here is a sample output:

 

Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000005D.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000005E.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000005F.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000060.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000061.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000062.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000063.log - OK
...
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000007C.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000007D.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000007E.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000007F.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000080.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000081.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000082.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000083.log - OK
Log file: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E0000000084.log – OK

Using the Exchange Management Shell a copy of the database was added to the other DAG member.  Exchange Management Shell confirms that the database seeding was successful. The following is a sample copy status:

 

[PS] C:\>Get-MailboxDatabaseCopyStatus "Mailbox Database 0958497998\MBX-2" | fl


RunspaceId                       : 14523823-d7fe-4e42-9558-9395f418e6d0
Identity                         : Mailbox Database 0958497998\MBX-2
Name                             : Mailbox Database 0958497998\MBX-2
DatabaseName                     : Mailbox Database 0958497998
Status                           : Failed
MailboxServer                    : MBX-2
ActiveDatabaseCopy               : mbx-1
ActivationSuspended              : False
ActionInitiator                  : Service
ErrorMessage                     : The required log file 73 for Mailbox Database 0958497998\MBX-2 is missing on the act
                                   ive copy. If you removed the log file, please replace it. If the log file is lost, t
                                   he database copy will need to be reseeded using Update-MailboxDatabaseCopy.

ErrorEventId                     : 2059
ExtendedErrorInfo                :
SuspendComment                   :
SinglePageRestore                : 0
ContentIndexState                : Healthy
ContentIndexErrorMessage         :
CopyQueueLength                  : 72
ReplayQueueLength                : 0
LatestAvailableLogTime           : 7/1/2012 2:12:32 PM
LastCopyNotificationedLogTime    : 7/1/2012 2:12:32 PM
LastCopiedLogTime                :
LastInspectedLogTime             :
LastReplayedLogTime              :
LastLogGenerated                 : 132
LastLogCopyNotified              : 131
LastLogCopied                    : 0
LastLogInspected                 : 0
LastLogReplayed                  : 131
LogsReplayedSinceInstanceStart   : 0
LogsCopiedSinceInstanceStart     : 0
LatestFullBackupTime             :
LatestIncrementalBackupTime      :
LatestDifferentialBackupTime     :
LatestCopyBackupTime             :
SnapshotBackup                   :
SnapshotLatestFullBackup         :
SnapshotLatestIncrementalBackup  :
SnapshotLatestDifferentialBackup :
SnapshotLatestCopyBackup         :
LogReplayQueueIncreasing         : False
LogCopyQueueIncreasing           : False
OutstandingDumpsterRequests      : {}
OutgoingConnections              :
IncomingLogCopyingNetwork        :
SeedingNetwork                   :
ActiveCopy                       : False

The application log on the member where the copy was added also has the following event:

 

Log Name:      Application
Source:        MSExchangeRepl
Date:          7/1/2012 2:12:35 PM
Event ID:      2059
Task Category: Service
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBX-2.exchange.msft
Description:
The required log file 73 for Mailbox Database 0958497998\MBX-2 is missing on the active copy. If you removed the log file, please replace it. If the log file is lost, the database copy will need to be reseeded using Update-MailboxDatabaseCopy.

 

If the seed is attempted again using update-mailboxdatabasecopy the same error is displayed. In this case log file 73 (0x49) is necessary to satisfy the seeding request and the lack of the log file will prevent the seed from being successful.

 

Why does this issue occur?

When a database copy is created and subsequently seeded between nodes the Replication services establish a global truncation point.  The global truncation point represents the log files that are safe to truncate across all copies of a given database that exist within the DAG.  This process takes into account a feature introduced in Exchange 2010 SP1 known as checkpoint at log creation.  For an example of how checkpoint at log creation works in Exchange 2010 refer to http://blogs.technet.com/b/timmcmic/archive/2012/03/12/exchange-2010-log-truncation-and-checkpoint-at-log-creation-in-a-database-availability-group.aspx.

 

The Replication service establishes the global truncation point based on the checkpoint at log creation of the lowest order log file on the active copy.  Running eseutil /ml on the lowest order log file on the active copy shows the following header information:

 

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode...

      Base name: E00
      Log file: E00000001E6.log
      lGeneration: 486 (0x1E6)
      Checkpoint: (0x360,80,0)
      creation time: 08/05/2012 06:25:24
      prev gen time: 08/05/2012 06:25:24
      Format LGVersion: (7.3704.16.2)
      Engine LGVersion: (7.3704.16.2)
      Signature: Create time:07/02/2012 08:31:36 Rand:2868451 Computer:
      Env SystemPath: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\
      Env LogFilePath: C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\
      Env Log Sec size: 4096 (matches)
      Env (CircLog,Session,Opentbl,VerPage,Cursors,LogBufs,LogFile,Buffers)
          (    off,   1227,  61350,  16384,  61350,   2048,    256,  29487)
      Using Reserved Log File: false
      Circular Logging Flag (current file): off
      Circular Logging Flag (past files): on
      Checkpoint at log creation time: (0x186,1,0)
      1 C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\Mailbox Database 0958497998.edb
                 dbtime: 964150 (0-964150)
                 objidLast: 4617
                 Signature: Create time:07/02/2012 08:31:36 Rand:2859643 Computer:
                 MaxDbSize: 0 pages
                 Last Attach: (0x16D,1,270)
                 Last Consistent: (0x16C,1,1F)

      Last Lgpos: (0x1e6,FF,0)

Number of database page references:  922

Integrity check passed for log file: E00000001E6.log


Operation completed successfully in 0.78 seconds.

 

In this case the checkpoint at log creation when the log file 0x1E6 was created was in log file 0x186.  Therefore the global truncation point is established at 0x186 (390).  The event establishing the global truncation point is present in the active copy’s High Availability \ TruncationDebug log:

 

Log Name:      Microsoft-Exchange-HighAvailability/TruncationDebug
Source:        Microsoft-Exchange-HighAvailability
Date:          8/5/2012 7:07:04 AM
Event ID:      224
Task Category: TruncationSource
Level:         Information
Keywords:     
User:          SYSTEM
Computer:      MBX-1.exchange.msft
Description:
Global truncation point for database '46f2fd7f-4a8e-4c56-81ac-51707746dfe2' calculated '390' with the value from server '390'.


When the seeding of the database file is successful the Replication service on the passive copy establishes a log shipping process.  At this time the global truncation point is read to determine where log file copying must start.

 

37683 00000000 1268 Cluster.Replay LogTruncater RequestGlobalTruncationCoordination for db 46f2fd7f-4a8e-4c56-81ac-51707746dfe2 notified our lowest is 0x234, learned the global truncation is 0x186

 

The Replication service establishes that the first log file that must be copied is the checkpoint at log creation of the lowest log file found on the active copy or in simpler terms the log file identified as the global truncation point.

 

37686 01E6E81B 1268 Cluster.Replay ReplicaInstance Log copying will start from generation 390

 

The Replication service then queries the source Replication service to determine the range of log files present on the active copy.

 

37758 009BAD9B 1268 Cluster.Replay LogCopyClient LogCopyClient:QueryLogRange: 0x1e6 .. 37a

 

At this time it is determined that the necessary log files do not exist on the active copy and the copy status is set to failed even though the database seeding was successful.

 

37760 00000000 1268 Cluster.Replay ReplayApi FormatMessage( 3221489675 ) was successful. Message: The required log file 390 for Mailbox Database 0958497998\MBX-2 is missing on the active copy. If you removed the log file, please replace it. If the log file is lost, the database copy will need to be reseeded using Update-MailboxDatabaseCopy.
.

Where did this log file go?  If you review the backup events from the successful backup of the single copy database the following event is noted:

 

Log Name:      Application
Source:        ESE
Date:          8/5/2012 6:37:32 AM
Event ID:      224
Task Category: ShadowCopy
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-1.exchange.msft
Description:
Information Store (3172) Mailbox Database 0958497998: Deleting log files C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E000000016E.log to C:\Program Files\Microsoft\Exchange Server\V14\Mailbox\Mailbox Database 0958497998\E00000001E5.log.
 

 

In this case the log file necessary for a successful seed operation was removed as a result of a successful full backup.  Remember that when a database has a single copy log truncation as a result of a full backup, incremental backup, or circular logging is under control of the Information Store service.  Whenever a database has multiple copies the Replication service handles these tasks.  In this instance the Information Store service utilizes different rules for log truncation than the Replication service does.  The Information Store service does not account for the global truncation point when performing log truncation.

 

This behavior cannot be altered.

 

How does one work around this issue?

 

Unfortunately once a database is converted from single copy to multiple copies and a seed operation has been completed the target database cannot be utilized.  The process must be started over again with a few modifications.

 

First the database copy should be added to the remote node with the seedingPostponed flag set to true.  This will update Active Directory with the additional database copy and will attempt to inform the target Replication service that a copy was added. 

 

add-mailboxdatabasecopy “DBName” –mailboxServer “TargetServer” –seedingPostponed:$TRUE

 

Second circular logging should be enabled on the database.  This does not require the database to be dismounted and mounted since the database now has multiple copies and when using continuous replication circular logging this flag can be set dynamically. 

 

set-mailboxdatabase “DBNAME” –circularLogging:$TRUE

 

It is very important at this step that sufficient time exists for the circular logging flag to replicate through Active Directory and for the target Replication service to acknowledge that circular logging is enabled.  If subsequent steps are attempted prior to all nodes acknowledging the change the missing log file failure condition will be encountered.

 

Third the database must be seeded to the passive node.

 

update-mailboxdatabasecopy “DBName\Server” –deleteExistingFiles:$TRUE

 

Lastly the health of the copy can be verified with get-mailboxdatabsecopystatus.  Here is an example of a successful seeding operation of our example database that experienced this issue and utilized this process:

 

[PS] C:\Windows\system32>Get-MailboxDatabaseCopyStatus

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
----                                          ------          --------- ----------- --------------------   ------------
Mailbox Database 1608324257\MBX-2             Mounted         0         0                                  Healthy
Mailbox Database 0958497998\MBX-2             Healthy         0         0           8/5/2012 7:47:08 AM    Healthy

Why does this process work?

 

At the end of a seed operation where circular logging is enabled the replication service simply begins log copying at the first log file found.  It is not concerned with maintaining the same log states between nodes as it is when circular logging is not enabled.  Therefore enabling circular logging in this situation allows us to bypass this problem.

 

Once seeding is successful circular logging can be dynamically disabled using the set-mailboxdatabase command.

 

set-mailboxdatabase “DBNAME” –circularLogging:$FALSE

 

When circular logging is disabled a backup should be taken if desired.  When circular logging was utilized by this process log files were truncated off of both copies leaving previous backups taken unable to be utilized for roll forward recovery.  The next backup taken would become the recovery point for roll forward recovery should a restoration be necessary.  Until the backup is taken the database availability group should be sufficient to provide primary data protection.

Part 5: Datacenter Activation Coordination: How do I force automount consensus?

$
0
0

In order for any database availability group (DAG) that has Datacenter Activation Coordination (DAC) mode to be able to mount any databases, the DAG itself much reach automount consensus. In the following blog posts, I discuss and outline situations where automount consensus may not be achieved, resulting in dismounted databases.  A DAG in DAC mode which cannot achieve automount consensus will not automatically mount databases, and administrators are blocked from manually mounting them, as well.

 

Let’s look at an example to illustrate this scenario.  In this example, I have a 4-member DAG comprised of servers DAG-1, DAG-2, DAG-3, and DAG-4

 

image

 

DAG-4 experiences a hardware failure, goes down, and is scheduled for replacement.  Although the server is offline, it is still a member of the DAG and it’s underlying cluster.

 

image

 

Next, a power failure occurs resulting in DAG-1, DAG-2, and DAG-3 going offline.

 

image

 

After power is restored, DAG-1, DAG-2, and DAG-3 are booted up.

 

image

 

The cluster service is successfully initialized and a cluster is formed.  You can verify the nodes are operational by running

 

Windows 2008:

PS C:\> cluster node
Listing status for all available nodes:

Node           Node ID Status
-------------- ------- ---------------------
DAG-4                1 Down
DAG-3                2 Up
DAG-2                3 Up
DAG-1                4 Up

 

 

Windows 2008 R2:

PS C:\> Import-Module FailoverClusters

PS C:\> Get-ClusterNode | fl name,state


Name  : DAG-1
State : Up

Name  : DAG-2
State : Up

Name  : DAG-3
State : Up

Name  : DAG-4
State : Down

 

Using the Get-MailboxDatabase cmdlet, we can see that, even though the servers are up and the cluster is running, no databases are mounted:

 

[PS] C:\>Get-MailboxDatabase -Status | fl name,mounted


Name    : Mailbox Database 1252068500
Mounted : False

Name    : Mailbox Database 1757981393
Mounted : False

Name    : Mailbox Database 1370762657
Mounted : False

Name    : Mailbox Database 1511135053
Mounted : False

 

Attempting to mount a database (with or without the –Force parameter) results in the following error:

 

[PS] C:\>Mount-Database "Mailbox Database 1252068500"
Couldn't mount the database that you specified. Specified database: Mailbox Database 1252068500; Error code: An Active Manager operation failed. Error An Active Manager operation encountered an error. To perform this operation, the server must be a member of a database availability group, and the database availability group must have quorum. Error: Automount consensus not reached.. [Server: DAG-1.exchange.msft].
    + CategoryInfo          : InvalidOperation: (Mailbox Database 1252068500:ADObjectId) [Mount-Database], InvalidOperationException
    + FullyQualifiedErrorId : FD5FDFE8,Microsoft.Exchange.Management.SystemConfigurationTasks.MountDatabase

So what needs to be done to achieve automount consensus and mount the databases?  It’s simple: the administrator can run Start-DatabaseAvailabilityGroup.  The cmdlet can be run using the –MailboxServer parameter to specify one of the mailbox servers that is functional.  After the cmdlet is run, the specified server will have its DACP bit set to 1.  The other servers in the DAG will see that the DACP bit on the specified server is set to 1 and this will cause them to set their DACP bits to 1.  And once all of the DAG members have DACP set to 1, automount consensus is reached, and the databases can be mounted. Here is an example:

 

[PS] C:\>Start-DatabaseAvailabilityGroup -Identity DAG -MailboxServer MBX-1

[PS] C:\>Get-MailboxDatabase -Status | fl name,mounted


Name    : Mailbox Database 1252068500
Mounted : True

Name    : Mailbox Database 1757981393
Mounted : True

Name    : Mailbox Database 1370762657
Mounted : True

Name    : Mailbox Database 1511135053
Mounted : True

 

Start-DatabaseAvailabilityGroup can also be used in scenarios where the remaining survivor of a DAG failure is a single member that is using the witness server boot time to achieve automount consensus.  In this case the cmdlet forces the member’s DACP bit to 1 and updates the boot times in the registry for future restarts.

 

As a side note of caution, administrators should be careful when using Start-DatabaseAvailabilityGroup.  It is possible that a split brain condition could be created by using this cmdlet to start the wrong DAG members, or it by using it at the wrong stage of a datacenter switchover. To ensure your datacenter switchovers go smoothly, I recommend using the recently released Datacenter Switchover Tool.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (http://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (http://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (http://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (http://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (http://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (http://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (http://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG... (http://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (http://aka.ms/Qlt035)

========================================================

Part 3: Datacenter Activation Coordination and the Single Node Cluster

$
0
0

In part 2 of this series, I provided a high level overview of how the witness server for a database availability group (DAG) is used with Datacenter Activation Coordination (DAC) mode in the scenario when only a single member remains after a datacenter switchover.  The witness server, and in particular its boot time, are very important when the system determines whether or not automount consensus has been reached.

 

When a datacenter switchover is performed and a single DAG member remains in the recovered datacenter, the restore-databaseavailabilitygroup cmdlet configures the cluster with a Node and File Share Majority quorum.  The witness server in use is the alternate witness server configured for the DAG or specified when running restore-databaseavailabilitygroup.  You can use get-databaseavailabilitygroup –status | fl name,*witness* to verify the witness in use, as illustrated below.

 

C:\>Get-DatabaseAvailabilityGroup -Identity DAG -Status | fl name,*witness*
Name                      : DAG
WitnessServer             : dc-1.exchange.msft
WitnessDirectory          : C:\DAGFileShareWitnesses\DAG.exchange.msft
AlternateWitnessServer    : dc-2.exchange.msft
AlternateWitnessDirectory : C:\DAGFileShareWitnesses\DAG.exchange.msft
WitnessShareInUse      : Alternate

This configuration, although automatic, may seem strange at first.  In general, a Windows failover cluster only leverages the Node and File Share Majority quorum model when the cluster contains an even number of members.  Although this configuration is normal and expected in this case, it can raise some concerns when administrators use Failover Cluster Manager to verify the status of the remaining cluster.  Even more confusing is the warning on the cluster summary page that states that a failure of a node to access the defined file share witness may result in cluster instability.

 

image

 

The main concern for DAGs is the risk of losing the witness server (either by failure or reboot) but still have the Exchange environment stay running.  Fortunately, in this configuration, when a single DAG member exists and the cluster is configured to use the Node and File Share Majority quorum model, the status of the witness server is ignored.  Thus, even if the witness server is lost for any reason, the Exchange environment will continue to function.  The decision to use this quorum model is partially based on the importance of the witness server in determining automount consensus in a single surviving member configuration.  By configuring the quorum model as node and file share majority, we can leverage the cluster’s ability to monitor the witness server, and to accurately report status of this machine.

 

Thus, it is expected that after a datacenter switchover, DAGs with a single surviving member will leverage the Node and File Share Majority quorum model.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (http://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (http://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (http://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (http://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (http://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (http://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (http://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG... (http://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (http://aka.ms/Qlt035)

========================================================

Part 4: Datacenter Activation Coordination and the Prevention of Split Brain

$
0
0

In Part 1 of this blog post series, I discussed the rules for mounting databases when Datacenter Activation Coordination (DAC) mode is enabled on a Database Availability Group (DAG).  In this post, I want to look at the specific scenarios that apply to the prevention of split brain after a Datacenter Switchover has been performed.

 

In the example below, we have a 4-member DAG that has two members in Datacenter-A and two in Datacenter-B.  There is a single domain controller installed in each datacenter, as well.

 

image

 

Imagine that an outage has occurred in Datacenter-A and the decision to perform a datacenter switchover has been made.  Currently the two remaining Exchange servers are online in Datacenter-B and the cluster is in a lost quorum state.   Using the integrated commands the administrator starts the switchover process by running stop-databaseavailabilitygroup for the DAG members in Datacenter-A, as illustrated below.

 

Stop-DatabaseAvailabilityGroup –identity DAG –activeDirectorySite:Datacenter-A –configurationOnly:$TRUE

 

The results of this command can be verified with get-databaseavailabilitygroup –identity DAGNAME | fl name,StartedMailboxServers,StoppedMailboxServers.  As expected, servers in Datacenter-B remain on the started servers list while servers in Datacenter-A are on the stopped servers list.

 

The administrator then stops the Cluster service on the surviving DAG members in Datacenter-B in preparation for the restore-databaseavailabilitygroup command.

 

Stop-Service CLUSSVC <or> net stop CLUSSVC

The final step is to run restore-databaseavailabilitygroup.  This task forces the Cluster services online on the remaining DAG members, evicts the DAG members on the stopped servers list, and configures the appropriate quorum model.

 

Restore-DatabaseAvailabilityGroup –identity DAGNAME –activeDirectorySite:Datacenter-B

 

At this point, the administrator might perform additional procedures for database activation, restoration of mail flow, and restoration of client access.

 

After several hours, the DAG members in Datacenter-A come back online but without network connectivity between Datacenter-A and Datacenter-B.  The cluster services on the DAG members in Datacenter-A will start and a cluster will be formed.  This is normal and expected, as two of the four members exist in Datacenter-A along with the witness server, thereby providing the 3 votes necessary to establish quorum.

 

You can verify the membership of the cluster as seen from a member in Datacenter-A:

 

Import-Module FailoverClusters

Get-ClusterNode | fl name,state

Name  : mbx-1
State : Up

Name  : mbx-2
State : Up

Name  : mbx-3
State : Down

Name  : mbx-4
State : Down

You can also verify the membership of the cluster as seen from a member in Datacenter-B:

 

Import-Module FailoverClusters

Get-ClusterNode | fl name,state

Name : mbx-3
State : Up

Name : mbx-4
State : Up

These outputs are both accurate when considering the steps that were taken during the datacenter switchover process.  We know that the restore-databaseavailabilitygroup command successfully forced the remaining members online in Datacenter-B and evicted members from Datacenter-A.  We also know that the members in Datacenter-A have not been able to establish communications with the members in Datacenter-B and they are therefore unaware that any cluster membership changes have occurred.

 

The stop-databaseavailabilitygroup cmdlet updated the stopped servers list on a domain controller in Datacenter-B, but a domain controller in Datacenter-A was not accessible at the time that command was issued.  Using the get-databaseavailabilitygroup command, we can verify the started and stopped servers list on the domain controller in Datacenter-A.

 

[PS] C:\>Get-DatabaseAvailabilityGroup -Identity DAG | fl name,startedmailboxservers,stoppedmailboxservers

Name                  : DAG
StartedMailboxServers : {MBX-2.exchange.msft, MBX-1.exchange.msft, MBX-3.exchange.msft, MBX-4.exchange.msft}
StoppedMailboxServers : {}

 

Comparing the two lists, note that according to a domain controller in Datacenter-A, all servers are started and no servers are stopped. According to a domain controller in Datacenter-B, two of the servers are started and two of the servers are stopped.

 

Even though the members in Datacenter-A established quorum, we can verify with get-mailboxdatabase –status | fl name,mounted that they did not mount their databases:

 

[PS] C:\>Get-MailboxDatabase -Status | fl name,mounted

Name    : Mailbox Database 1252068500
Mounted : False

Name    : Mailbox Database 1757981393
Mounted : False

Name    : Mailbox Database 1370762657
Mounted :

WARNING: Exchange can't connect to the Information Store service on server MBX-4.exchange.msft. Make sure that the
service is running and that there is network connectivity to the server.
Name    : Mailbox Database 1511135053
Mounted :

WARNING: Exchange can't connect to the Information Store service on server MBX-3.exchange.msft. Make sure that the
service is running and that there is network connectivity to the server
.

 

So why didn’t the databases mount in Datacenter-A, even though the members had quorum?

 

DAC mode works by using a bit stored in memory by Active Manager called the Datacenter Activation Coordination Protocol (DACP). DACP is simply a bit in memory set to either a 1 or a 0. A value of 1 means Active Manager can issue mount requests, and a value of 0 means it cannot.

 

The starting bit is always 0, and because the bit is held in memory, any time the Microsoft Exchange Replication service (MSExchangeRepl.exe) is stopped and restarted, the bit reverts to 0.  In the example of a lost data center the bit is set to 0 when the servers power on and the replication service initializes.   In order to change its DACP bit to 1 and be able to mount databases, a starting DAG member needs to either:

 

  • Be able to communicate with any other DAG member that has a DACP bit set to 1; or
  • Be able to communicate with all DAG members that are listed on the StartedMailboxServers list.

 

If either condition is true, Active Manager on a starting DAG member will issue mount requests for the active databases copies it hosts. If neither condition is true, Active Manager will not issue any mount requests.

 

In order for the DACP bit to be set to 1 (mount database allowed) the starting DAG member must also be a member of the DAG’s cluster, and the cluster must have quorum.

 

In this example MBX-1 can contact MBX-2 but no other members of the DAG.  MBX-2 does not have its DACP bit set to 1 and MBX-1 cannot contact all servers on the started servers list because AD has not replicated the updated started servers list from Datacenter-B and therefore all nodes in the DAG appear on the started servers list.

 

By enforcing the logic of contacting another member with a DACP bit set to 1 or contacting all servers on the started servers list, a split brain condition is prevented even when a quorum of nodes exist and the cluster service functions.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (http://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (http://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (http://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (http://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (http://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (http://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (http://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG... (http://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (http://aka.ms/Qlt035)

========================================================

Part 6: Datacenter Activation Coordination – Who has a say?

$
0
0

I recently worked with a customer that had a three-member database availability group (DAG) that was extended to two sites in a site resilience configuration. During scheduled maintenance in the primary datacenter, the customer encountered an interesting situation. In this case, the customer had two DAG members deployed in their primary datacenter and the third member deployed in a remote datacenter. In addition, Datacenter Activation Coordination (DAC) mode was enabled for their DAG.

 

There was a need to shut down the servers in the primary datacenter.  After completing maintenance tasks the servers in the primary datacenter were powered on.  It was then noted that all of the databases were dismounted on the servers in the primary datacenter.  This was verified with get-mailboxdatabase –status | fl name,mounted:

 

[PS] C:\>Get-MailboxDatabase -Status | fl name,mounted


Name    : Mailbox Database 1252068500
Mounted : False

Name    : Mailbox Database 1370762657
Mounted : False

Name    : Mailbox Database 1511135053
Mounted : False

Name    : Mailbox Database 1757981393
Mounted : False

 

So, the administrator issued a mount command, but an error was returned:

 

[PS] C:\>Mount-Database "Mailbox Database 1370762657"
Couldn't mount the database that you specified. Specified database: Mailbox Database 1370762657; Error code: An Active Manager operation failed. Error An Active Manager operation encountered an error. To perform this operation, the server must be a member of a database availability group, and the database availability group must have quorum. Error: Automount consensus not reached.. [Server: MBX-1.exchange.msft].
    + CategoryInfo          : InvalidOperation: (Mailbox Database 1370762657:ADObjectId) [Mount-Database], InvalidOperationException
    + FullyQualifiedErrorId : FE7E9C2B,Microsoft.Exchange.Management.SystemConfigurationTasks.MountDatabase

 

The error indicates that the DAG members must have quorum and automount consensus in order to mount databases.  Because the DAG had DAC mode enabled, in order for automount consensus to be reached:

 

  • The node must be a member of a cluster.
  • The cluster must have quorum.
  • The node must be able to contact another member with a DACP bit set to 1 <or> it must be able to contact all other servers on the started servers list.

 

When the DAG members in the primary datacenter were shut down, the remaining DAG member went into a lost quorum state.  Therefore the DACP bit of the third member changed to 0 in response to a cluster service state change.  When the servers in the primary datacenter restart, they are unable to contact another DAG member with a DACP bit set to 1. Reviewing the properties of the DAG we saw that all three DAG members were on the started Mailbox servers list:

 

[PS] C:\>Get-DatabaseAvailabilityGroup -Identity DAG | fl name,startedmailboxservers,stoppedmailboxservers


Name                  : DAG
StartedMailboxServers : {MBX-3.exchange.msft, MBX-2.exchange.msft, MBX-1.exchange.msft}
StoppedMailboxServers : {}

 

It was possible for the servers in the primary datacenter to contact the Microsoft Exchange Replication service on all servers on the started Mailbox servers list.  So why then is automount consensus not reached?

 

When reviewing the status of servers in the cluster, we noted that the server in the remote datacenter was marked as down:

 

Import-Module FailoverClusters

Get-ClusterNode | fl name,state

Name  : mbx-1
State : Up

Name  : mbx-2
State : Up

Name  : mbx-3
State : Down

Why does MBX-3 report a status of down?  Traditionally, when a lost quorum condition is encountered we expect the Cluster service on the servers where quorum was lost to terminate.  Looking at the properties of the Cluster service we see that the default action is to restart when the service terminates.

 

image

 

In reviewing the application log on MBX-3 for events that occurred at the time MBX-1 and MBX-2 were shut down, we saw the following events:

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          8/6/2012 10:03:04 AM
Event ID:      1177
Task Category: Quorum Manager
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      MBX-3.exchange.msft
Description:
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

 

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 10:03:04 AM
Event ID:      7036
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service entered the stopped state.

 

We saw that the Cluster service acknowledged that there were no longer enough servers to maintain quorum and that the Cluster service entered the stop state gracefully – it did not terminate.  When MBX-1 and MBX-2 were gracefully shutdown in this example they communicate with all servers in the cluster announcing that they will be leaving.  In other words the servers in the primary datacenter did not just unexpectedly disappear, as in the case of a network failure or other catastrophic failure.

 

Since MBX-3 was informed that the other servers were leaving, it then determined that not enough votes would remain to satisfy quorum, and gracefully stopped its Cluster service rather than terminating it.  When MBX-1 and MBX-2 were brought back online they subsequently formed a cluster using their votes (only 2 of 3 votes necessary) and then began the process of determining automount consensus.  Since MBX-3 was a member of a cluster, but did not have its Cluster service started, it had no response to the DACP bit inquiry.  The condition that we must contact all servers on the started servers list of the DAG when no servers advertise a DACP bit of 1 was not met.

 

To resolve this, the administrator simply needs to restart the Cluster service on MBX-3.  This will in most cases result in databases mounting automatically as it allows the criteria for automount consensus to be satisfied and reached.  Here is a sample showing databases automatically mounted after starting the cluster service on MBX-3.

 

PS C:\> Get-ClusterNode | fl name,state


Name  : mbx-1
State : Up

Name  : mbx-2
State : Up

Name  : mbx-3
State : Up


[PS] C:\>Get-MailboxDatabase -Status | fl name,mounted


Name    : Mailbox Database 1252068500
Mounted : True

Name    : Mailbox Database 1757981393
Mounted : True

Name    : Mailbox Database 1370762657
Mounted : True

Name    : Mailbox Database 1511135053
Mounted : True

To illustrate the difference, here is an example where MBX-1 and MBX-2 were powered off instead of being gracefully shut down.  The events on MBX-3 show that the servers left unexpectedly and the Cluster service was terminated due to a lost quorum condition.

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          8/6/2012 12:15:00 PM
Event ID:      1135
Task Category: Node Mgr
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      MBX-3.exchange.msft
Description:
Cluster node 'MBX-2' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          8/6/2012 12:15:00 PM
Event ID:      1135
Task Category: Node Mgr
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      MBX-3.exchange.msft
Description:
Cluster node 'MBX-1' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          8/6/2012 12:15:00 PM
Event ID:      1177
Task Category: Quorum Manager
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      MBX-3.exchange.msft
Description:
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 12:15:00 PM
Event ID:      7036
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service entered the stopped state.

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 12:15:00 PM
Event ID:      7024
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service terminated with service-specific error A quorum of cluster nodes was not present to form a cluster..

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 12:15:00 PM
Event ID:      7031
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 12:16:01 PM
Event ID:      7036
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service entered the running state.

Based on the events present, the Cluster service was terminated and the Service Control Manager issued a restart.  When MBX-1 and MBX-2 come back up, MBX-3 will successfully join the cluster.  Thus in this scenario, each DAG member can contact all servers on the started mailbox servers list, receive a response, and automount consensus can be reached, and databases will automatically mount.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (http://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (http://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (http://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (http://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (http://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (http://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (http://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG... (http://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (http://aka.ms/Qlt035)

========================================================


Exchange 2010: StopDagServerMaintenance.ps1 resets server and database suspension states…

$
0
0

In Exchange 2010, when a database availability group (DAG) member needs service, it can be placed into maintenance mode. To do this, Exchange 2010 includes scripts for this process called StartDagServerMaintenance.ps1 and StopDagServerMaintenace.ps1.  For a summary of what these scripts do see this post.

 

Within a DAG, it is not uncommon to have one or more databases or servers blocked from automatic activation by the system. Some customers configure entire servers to be blocked from activation, some block individual copies, and some do a combination of both, based on their business requirements. Administrators using the maintenance mode scripts will find their configured activation blocks reset to the unblocked.  This behavior is not a problem with the scripts; in fact, the scripts are working as designed.

 

Here is an example of a database copy that has had activation suspended:

 

[PS] C:\Windows\system32>Get-MailboxDatabaseCopyStatus DAG-DB0\MBX-2 | fl name,activationSuspended


Name                             : DAG-DB0\MBX-2
ActivationSuspended              : True


Here is an example of a server that has activation blocked:

 


[PS] C:\Windows\system32>Get-MailboxServer MBX-2 | fl name,databasecopyautoactivationpolicy


Name                             : MBX-2
DatabaseCopyAutoActivationPolicy : Blocked

 

When the administrator executes the stopDagServerMaintenance.ps1 script these states are reset back to their defaults.  Here is an example of the states post StopDagServerMaintenance.ps1:

 


[PS] C:\Program Files\Microsoft\Exchange Server\V14\Scripts>Get-MailboxDatabaseCopyStatus DAG-DB0\MBX-2 | fl name,activationSuspended


Name                             : DAG-DB0\MBX-2
ActivationSuspended              : False

[PS] C:\Program Files\Microsoft\Exchange Server\V14\Scripts>Get-MailboxServer MBX-2 | fl name,databasecopyautoactivationpolicy


Name                             : MBX-2
DatabaseCopyAutoActivationPolicy : Unrestricted

 

While the maintenance scripts behavior are by design, it can lead to undesirable scenarios, such as lagged database copies being activated. Of course, to eliminate these issues, an administrator can record database and server settings before and after maintenance and reconfigure any reset settings as needed.  To help with this, here is a link to a sample script I created that records database and server activation settings into a csv file which can then be used with the maintenance scripts to adjust the states automatically.

 

http://blogs.technet.com/b/exchange/archive/2013/04/10/preserving-activation-blocks-after-performing-dag-member-maintenance.aspx

 

At runtime, and on the server on which this script is run, two CSV files are created along with a transcript that contains the results of the command executed. The first CSV file contains all database copies assigned to the server and their activation suspension status. For example:

 

#TYPE Selected.Microsoft.Exchange.Management.SystemConfigurationTasks.DatabaseCopyStatusEntry
"Name","ActivationSuspended"
"DAG-DB0\DAG-3","True"
"DAG-DB1\DAG-3","True"


The second CSV file contains the database copy auto activation policy of the server. For example:

 

#TYPE Selected.Microsoft.Exchange.Data.Directory.Management.MailboxServer
"Name","DatabaseCopyAutoActivationPolicy"
"DAG-3","Blocked"

To use the script the administrator must relax the execution policy on the server to allow for unsigned scripts.  This can be accomplished by using powershell:

 

Set-ExecutionPolicy UNSIGNED

 

Once the execution policy has been adjusted the maintenance procedure is started using the following command:

 

maintenanceWrapper.ps1 –server <SERVERNAME> –action START

 

It is this command that is responsible for creating the CSV files containing the original database states and invoking the StartDagServerMaintenance.ps1 script.

 

When the maintenance work is completed the maintenance procedure can be stopped by using the following command:

 

maintenanceWrapper.ps1 –server <SERVERNAME> –action STOP

It is this command this is responsible for calling the StopDagServerMaintenance.ps1 script and then resetting the database and server activation states from the states recorded in the CSV file.


Give the script a try and see if it makes maintenance mode for activation-blocked servers and databases easier for you. I hope you find this useful, and I welcome any and all feedback.

 

*Special thanks to Scott Schnoll and Abram Jackson for reviewing this content and David Spooner for validating the script.

Exchange 2010 Database Availability Groups and Disk Sector Sizes

$
0
0

These days, some customers are deploying Exchange databases and log files on advanced format (4K) drives.  Although these drives support a physical sector size of 4096, many vendors are emulating 512 byte sectors in order to maintain backwards compatibility with application and operating systems.  This is known as 512 byte emulation (512e).  Windows 2008 and Windows 2008 R2 support native 512 byte and 512 byte emulated advanced format drives.  Windows 2012 supports drives of all sector sizes.  The sector size presented to applications and the operating system, and how applications respond, directly affects data integrity and performance.

 

For more information on sector sizes see the following links:

 

When deploying an Exchange 2010 Database Availability Group (DAG), the sector sizes of the volumes hosting the databases and log files must be the same across all nodes within the DAG.  This requirement is outlined at http://technet.microsoft.com/en-us/library/ee832792(v=exchg.141).aspx.

 

“Support requires that all copies of a database reside on the same physical disk type. For example, it is not a supported configuration to host one copy of a given database on a 512-byte sector disk and another copy of that same database on a 512e disk. Also be aware that 4-kilobyte (KB) sector disks are not supported for any version of Microsoft Exchange and 512e disks are not supported for any version of Exchange prior to Exchange Server 2010 SP1.”

 

Recently, we have noted that some customers have experienced issues with log file replication and replay as the result of sector size mismatch.  These issues occur when:

 

  • Storage drivers are upgraded resulting in the recognized sector size changing.
  • Storage firmware is upgraded resulting in the recognized sector size changing.
  • New storage is presented or existing storage is replaced with drives of a different sector size.

 

This mismatch can cause one or more database copies in a DAG to fail, as illustrated below. In my example environment, I have a three-member DAG with a single database that resides on a volume labeled Z that is replicated between each member.

 

 

[PS] C:\>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
----                                          ------          --------- ----------- --------------------   ------------
SectorTest\MBX-1                              Mounted         0         0                                  Healthy
SectorTest\MBX-2                              Healthy         0         1           3/19/2013 10:27:50 AM  Healthy
SectorTest\MBX-3                              Healthy         0         1           3/19/2013 10:27:50 AM  Healthy

 

If I use FSUTIL to query the Z volume on each DAG member, we can see that the Z volume currently has 512 logical bytes per sector and a 512 physical bytes per sector. Thus, the Z volume is currently seen by the operating system as having a native 512 byte sector size.

 

MBX-1:

C:\>fsutil fsinfo ntfsinfo z:
NTFS Volume Serial Number :       0x18d0bc1dd0bbfed6
Version :                         3.1
Number Sectors :                  0x000000000fdfe7ff
Total Clusters :                  0x0000000001fbfcff
Free Clusters  :                  0x0000000001fb842c
Total Reserved :                  0x0000000000000000
Bytes Per Sector  :               512
Bytes Per Physical Sector :       512
Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000000040000
Mft Start Lcn  :                  0x00000000000c0000
Mft2 Start Lcn :                  0x0000000000000002
Mft Zone Start :                  0x00000000000c0040
Mft Zone End   :                  0x00000000000cc840
RM Identifier:        EF486117-9094-11E2-BF55-00155D006BA1

MBX-3:

C:\>fsutil fsinfo ntfsinfo z:
NTFS Volume Serial Number :       0x0ad44aafd44a9d37
Version :                         3.1
Number Sectors :                  0x000000000fdfe7ff
Total Clusters :                  0x0000000001fbfcff
Free Clusters  :                  0x0000000001fad281
Total Reserved :                  0x0000000000000000
Bytes Per Sector  :               512
Bytes Per Physical Sector :       512

Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000000040000
Mft Start Lcn  :                  0x00000000000c0000
Mft2 Start Lcn :                  0x0000000000000002
Mft Zone Start :                  0x00000000000c0000
Mft Zone End   :                  0x00000000000cc820
RM Identifier:        B9B00E32-90B2-11E2-94E9-00155D006BA3

But what happens if there is a change in the way storage is seen on MBX-3, so that the volume now reflects a 512e sector size.  This can happen when upgrading storage drivers, upgrading firmware, or presenting new storage that implements advanced format storage.

 

C:\>fsutil fsinfo ntfsinfo z:
NTFS Volume Serial Number :       0x0ad44aafd44a9d37
Version :                         3.1
Number Sectors :                  0x000000000fdfe7ff
Total Clusters :                  0x0000000001fbfcff
Free Clusters  :                  0x0000000001fad2e7
Total Reserved :                  0x0000000000000000
Bytes Per Sector  :               512
Bytes Per Physical Sector :       4096

Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000000040000
Mft Start Lcn  :                  0x00000000000c0000
Mft2 Start Lcn :                  0x0000000000000002
Mft Zone Start :                  0x00000000000c0040
Mft Zone End   :                  0x00000000000cc840
RM Identifier:        B9B00E32-90B2-11E2-94E9-00155D006BA3

 

When reviewing the database copy status, notice that the copy assigned to MBX-3 has failed.

 

[PS] C:\>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
----                                          ------          --------- ----------- --------------------   ------------
SectorTest\MBX-1                              Mounted         0         0                                  Healthy
SectorTest\MBX-2                              Healthy         0         0           3/19/2013 11:13:05 AM  Healthy
SectorTest\MBX-3                              Failed          0         8           3/19/2013 11:13:05 AM  Healthy

The full details of the copy status of MBX-3 can be reviewed to display the detailed error:

 

[PS] C:\>Get-MailboxDatabaseCopyStatus SectorTest\MBX-3 | fl

RunspaceId                       : 5f4bb58b-39fb-4e3e-b001-f8445890f80a
Identity                         : SectorTest\MBX-3
Name                             : SectorTest\MBX-3
DatabaseName                     : SectorTest
Status                           : Failed
MailboxServer                    : MBX-3
ActiveDatabaseCopy               : mbx-1
ActivationSuspended              : False
ActionInitiator                  : Service
ErrorMessage                     : The log copier was unable to continue processing for database 'SectorTest\MBX-3' bec
                                   ause an error occured on the target server: Continuous replication - block mode has
                                   been terminated. Error: the log file sector size does not match the current volume's
                                    sector size (-546) [HResult: 0x80131500]. The copier will automatically retry after
                                    a short delay.

ErrorEventId                     : 2152
ExtendedErrorInfo                :
SuspendComment                   :
SinglePageRestore                : 0
ContentIndexState                : Healthy
ContentIndexErrorMessage         :
CopyQueueLength                  : 0
ReplayQueueLength                : 7
LatestAvailableLogTime           : 3/19/2013 11:13:05 AM
LastCopyNotificationedLogTime    : 3/19/2013 11:13:05 AM
LastCopiedLogTime                : 3/19/2013 11:13:05 AM
LastInspectedLogTime             : 3/19/2013 11:13:05 AM
LastReplayedLogTime              : 3/19/2013 10:24:24 AM
LastLogGenerated                 : 53
LastLogCopyNotified              : 53
LastLogCopied                    : 53
LastLogInspected                 : 53
LastLogReplayed                  : 46
LogsReplayedSinceInstanceStart   : 0
LogsCopiedSinceInstanceStart     : 0
LatestFullBackupTime             :
LatestIncrementalBackupTime      :
LatestDifferentialBackupTime     :
LatestCopyBackupTime             :
SnapshotBackup                   :
SnapshotLatestFullBackup         :
SnapshotLatestIncrementalBackup  :
SnapshotLatestDifferentialBackup :
SnapshotLatestCopyBackup         :
LogReplayQueueIncreasing         : False
LogCopyQueueIncreasing           : False
OutstandingDumpsterRequests      : {}
OutgoingConnections              :
IncomingLogCopyingNetwork        :
SeedingNetwork                   :
ActiveCopy                       : False

 

Using ERR we can verify the definition of the –546.

 

D:\Utilities\ERR>err -546
# for decimal -546 / hex 0xfffffdde
  JET_errLogSectorSizeMismatch                                   esent98.h
# /* the log file sector size does not match the current
# volume's sector size */
# 1 matches found for "-546"

 

In addition, the application event log may contain the following entries:

 

Log Name:      Application
Source:        MSExchangeRepl
Date:          3/19/2013 11:14:58 AM
Event ID:      2152
Task Category: Service
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The log copier was unable to continue processing for database 'SectorTest\MBX-3' because an error occured on the target server: Continuous replication - block mode has been terminated. Error: the log file sector size does not match the current volume's sector size (-546) [HResult: 0x80131500]. The copier will automatically retry after a short delay.
Event Xml:
<Event xmlns="
http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="MSExchangeRepl" />
    <EventID Qualifiers="49156">2152</EventID>
    <Level>2</Level>
    <Task>1</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2013-03-19T18:14:58.000000000Z" />
    <EventRecordID>2502</EventRecordID>
    <Channel>Application</Channel>
    <Computer>MBX-3.exchange.msft</Computer>
    <Security />
  </System>
  <EventData>
    <Data>SectorTest\MBX-3</Data>
    <Data>Continuous replication - block mode has been terminated. Error: the log file sector size does not match the current volume's sector size (-546) [HResult: 0x80131500].</Data>
  </EventData>
</Event>

Why does this issue occur?  Each log file records in the header the sector size of the disk where a log file was created.  For example, this is the header of a log file on MBX-1 with a native 512 byte sector size:

 

Z:\SectorTest>eseutil /ml E0100000001.log

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode...

      Base name: E01
      Log file: E0100000001.log
      lGeneration: 1 (0x1)
      Checkpoint: (0x38,FFFF,FFFF)
      creation time: 03/19/2013 09:40:14
      prev gen time: 00/00/1900 00:00:00
      Format LGVersion: (7.3704.16.2)
      Engine LGVersion: (7.3704.16.2)
      Signature: Create time:03/19/2013 09:40:14 Rand:11019164 Computer:
      Env SystemPath: z:\SectorTest\
      Env LogFilePath: z:\SectorTest\
      Env Log Sec size: 512 (matches)
      Env (CircLog,Session,Opentbl,VerPage,Cursors,LogBufs,LogFile,Buffers)
          (    off,   1227,  61350,  16384,  61350,   2048,   2048,  44204)
      Using Reserved Log File: false
      Circular Logging Flag (current file): off
      Circular Logging Flag (past files): off
      Checkpoint at log creation time: (0x1,8,0)

      Last Lgpos: (0x1,A,0)

Number of database page references:  0

Integrity check passed for log file: E0100000001.log

Operation completed successfully in 0.62 seconds.

 

The sector size that is chosen is determined through one of two methods:

 

  • If the log stream is brand new, read the sector size from disk and utilize this sector size.
  • If the log stream already exists, use the sector size of the given log stream.

 

In theory, since the sector size of disks should not be changing across nodes and the sector size of all disks must match, this should not cause a problem.  In our example, and in some customer environments, these sector sizes are actually changing.  Since most of these databases already exist, the existing sector size of the log stream is utilized, which in turn causes a mismatch between DAG members.

 

When a mismatch occurs, the issue only prevents the successful use of block mode replication.  It does not affect file mode replication.  Block mode replication was introduced in Exchange 2010 Service Pack 1.  For more information on block mode replication, see http://technet.microsoft.com/en-us/library/ff625233(v=exchg.141).aspx.

 

Why does this only affect block mode replication?  When a log file is addressed we reference locations within a log file based off a log file position.  The log file position is a combination of the log generation, the sector, and offset within that sector.  For example, in the previous header dump you can see the “Last LGPOS” is (0x1,A,0) – this just happens to be the last log file position within the log.  Let us say we were creating a block for block mode replication within a log file generation 0x1A, sector 8, offset 1 – this would be reflected as an LGPOS of (0x1a,8,1).  When this block is transmitted to a host with an advanced sector size disk, the log position would actually have to be translated.  On an advanced format disk this same log position would be (0x1a,1,1).  As you can see, it could create significant problems if incorrect positions within a log file were written to or read from.

 

How do I go about correcting this condition?  First, ensure that the same sector sizes exist on all disks across all nodes that host Exchange data, and then reset the log stream.  The following steps can show you how to do this with minimal downtime.

 

0)  Ensure that Exchange 2010 Service Pack 2 or later is installed on all DAG members (Exchange 2010 Service Pack 1 and earlier do not support 512e volumes).

 

1)  Disable block mode replication on all hosts.  This step requires restarting the replication service on each node.  This will temporarily cause all copies to fail on passive nodes when the service is restarted on the active node.  When the service is restarted on the passive node only passive copies on that node will enter a failed state.  Databases that are mounted and client connections are not impacted by this activity.  Block mode replication should remain disabled until all steps have been completed on all DAG members.

 

  • Launch registry editor.
  • Navigate to HKLM –> Software –> Microsoft –> ExchangeServer –> V14 –> Replay –> Parameters
  • Right click in the parameters key and select NEW –> DWORD
  • The name for the DWORD is DisableGranularReplication
  • The value for the DWORD is 1

 

image

 

2)  Restart the Microsoft Exchange Replication service on each member.

 

  • Using Powershell issue the following:  Stop-Service MSExchangeRepl
  • Using PowerShell issue the following:  Start-Service MSExchangeRepl

 

PS C:\> Stop-Service MSExchangeREPL
PS C:\> Start-Service MSExchangeREPL

 

3)  Validate that all copies of databases across DAG members are healthy at this time by running Get-MailboxDatabaseCopyStatus.

 

  • Using the Exchange Management Shell execute get-mailboxdatabasecopystatus.

 

[PS] C:\>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
----                                          ------          --------- ----------- --------------------   ------------
SectorTest\MBX-1                              Mounted         0         0                                  Healthy
SectorTest\MBX-2                              Healthy         0         0           3/19/2013 12:28:34 PM  Healthy
SectorTest\MBX-3                              Healthy         0         0           3/19/2013 12:28:34 PM  Healthy

 

4)  Apply the appropriate hotfix for Windows Server 2008 or Windows Server 2008 R2 and Advanced Format Disks.  Windows Server 2012 does not require a hotfix.

 

 

5)  Repeat the procedure that caused the disk sector size to change.  For example, if the issue arose as a result of upgrading drivers and firmware on a host utilize your maintenance mode procedures to complete the driver and firmware upgrade on all hosts.  Note:  If your installation does not allow for you to use the same sector sizes across all DAG members, then the implementation is not supported.

 

6)  Utilize FSUTIL to ensure that the sector sizes match across all hosts for the log and database volumes. 

 

MBX-1:

C:\>fsutil fsinfo ntfsinfo z:
NTFS Volume Serial Number :       0x18d0bc1dd0bbfed6
Version :                         3.1
Number Sectors :                  0x000000000fdfe7ff
Total Clusters :                  0x0000000001fbfcff
Free Clusters  :                  0x0000000001fac6e6
Total Reserved :                  0x0000000000000000
Bytes Per Sector  :               512
Bytes Per Physical Sector :       4096

Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000000040000
Mft Start Lcn  :                  0x00000000000c0000
Mft2 Start Lcn :                  0x0000000000000002
Mft Zone Start :                  0x00000000000c0040
Mft Zone End   :                  0x00000000000cc840
RM Identifier:        EF486117-9094-11E2-BF55-00155D006BA1

 

 

MBX-2

C:\>fsutil fsinfo ntfsinfo z:
NTFS Volume Serial Number :       0xfa6a794c6a790723
Version :                         3.1
Number Sectors :                  0x000000000fdfe7ff
Total Clusters :                  0x0000000001fbfcff
Free Clusters  :                  0x0000000001fac86f
Total Reserved :                  0x0000000000000000
Bytes Per Sector  :               512
Bytes Per Physical Sector :       4096

Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000000040000
Mft Start Lcn  :                  0x00000000000c0000
Mft2 Start Lcn :                  0x0000000000000002
Mft Zone Start :                  0x00000000000c0040
Mft Zone End   :                  0x00000000000cc840
RM Identifier:        5F18A2FC-909E-11E2-8599-00155D006BA2

 

MBX-3

C:\>fsutil fsinfo ntfsinfo z:
NTFS Volume Serial Number :       0x0ad44aafd44a9d37
Version :                         3.1
Number Sectors :                  0x000000000fdfe7ff
Total Clusters :                  0x0000000001fbfcff
Free Clusters  :                  0x0000000001fabfd6
Total Reserved :                  0x0000000000000000
Bytes Per Sector  :               512
Bytes Per Physical Sector :       4096

Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000000040000
Mft Start Lcn  :                  0x00000000000c0000
Mft2 Start Lcn :                  0x0000000000000002
Mft Zone Start :                  0x00000000000c0040
Mft Zone End   :                  0x00000000000cc840
RM Identifier:        B9B00E32-90B2-11E2-94E9-00155D006BA3

 

At this point, the DAG should be stable, and replication should be occurring as expected between databases using file mode. In order to restore block mode replication and fully recognize the new disk sector sizes, the log stream must be reset.  Please note the following about resetting the log stream:

 

  • The log stream must be fully reset on all database copies.
  • All lagged database copies must be replayed to current log.
  • If backups are utilized as a recovery method this will introduce a gap in the log file sequence preventing  a full roll forward recovery from the last backup point.

 

To reset the log file stream the following steps can be utilized:

 

1)  Validate the existence of a replay queue using Get-MailboxDatabaseCopyStatus:

 

[PS] C:\>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
----                                          ------          --------- ----------- --------------------   ------------
SectorTest\MBX-1                              Mounted         0         0                                  Healthy
SectorTest\MBX-2                              Healthy         0         0           3/19/2013 1:34:37 PM   Healthy
SectorTest\MBX-3                              Healthy         0         138         3/19/2013 1:34:37 PM   Healthy

 

2)  Set the replay and truncation lag times values to 0 on all database copies.  This will ensure that logs replay to current while allowing the databases to remain online.  In my example MBX-3 is a lagged copy database. When the configuration change is detected, log replay will occur allowing the lagged copy to eventually catch up.  Note that depending on the replay lag time, this could take several hours before proceeding to next steps.

 

[PS] C:\>Set-MailboxDatabaseCopy SectorTest\MBX-3 -ReplayLagTime 0.0:0:0 -TruncationLagTime 0.0:0:0

 

3)  Validate that the replay queue has caught up and is near zero.

 

[PS] C:\>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
----                                          ------          --------- ----------- --------------------   ------------
SectorTest\MBX-1                              Mounted         0         0                                  Healthy
SectorTest\MBX-2                              Healthy         0         0           3/19/2013 1:34:37 PM   Healthy
SectorTest\MBX-3                              Healthy         0         0           3/19/2013 1:34:37 PM   Healthy

4)  Dismount the database.  This activity will cause a client interruption, which will continue until the database can be mounted.

 

[PS] C:\>Dismount-Database SectorTest

Confirm
Are you sure you want to perform this action?
Dismounting database "SectorTest". This may result in reduced availability for mailboxes in the database.
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): y

[PS] C:\>Get-MailboxDatabaseCopyStatus SectorTest\*

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
----                                          ------          --------- ----------- --------------------   ------------
SectorTest\MBX-1                              Dismounted      0         0                                  Healthy
SectorTest\MBX-2                              Healthy         0         0           3/25/2013 5:41:54 AM   Healthy
SectorTest\MBX-3                              Healthy         0         0           3/25/2013 5:41:54 AM   Healthy

 

5)  On each DAG member hosting a database copy, open a command prompt and navigate to the log file directory.  Execute eseutil /r ENN to perform a soft recovery.  This steps is necessary to ensure that all log files are played into all copies.

 

Z:\SectorTest>eseutil /r e01

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating RECOVERY mode...
    Logfile base name: e01
            Log files: <current directory>
         System files: <current directory>

Performing soft recovery...
                      Restore Status (% complete)

          0    10   20   30   40   50   60   70   80   90  100
          |----|----|----|----|----|----|----|----|----|----|
          ...................................................

Operation completed successfully in 0.203 seconds.

 

6)  On each DAG member hosting a database copy open a command prompt and navigate to the database directory.  Execute eseutil /mh <EDB> against the database to dump the header.  You must validate that the following information is correct on all database copies:

 

  • All copies of the database show in clean shutdown.
  • All copies of the database show the same last detach information.
  • All copies of the database show the same last consistent information.

 

Here is example output of a full /mh dump followed by a comparison of the data across our three sample copies.

 

Z:\SectorTest>eseutil /mh SectorTest.edb

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode...
         Database: SectorTest.edb

DATABASE HEADER:
Checksum Information:
Expected Checksum: 0x010f4400
  Actual Checksum: 0x010f4400

Fields:
        File Type: Database
         Checksum: 0x10f4400
   Format ulMagic: 0x89abcdef
   Engine ulMagic: 0x89abcdef
Format ulVersion: 0x620,17
Engine ulVersion: 0x620,17
Created ulVersion: 0x620,17
     DB Signature: Create time:03/19/2013 09:40:15 Rand:11009066 Computer:
         cbDbPage: 32768
           dbtime: 601018 (0x92bba)
            State: Clean Shutdown
     Log Required: 0-0 (0x0-0x0)
    Log Committed: 0-0 (0x0-0x0)
   Log Recovering: 0 (0x0)
  GenMax Creation: 00/00/1900 00:00:00
         Shadowed: Yes
       Last Objid: 3350
     Scrub Dbtime: 0 (0x0)
       Scrub Date: 00/00/1900 00:00:00
     Repair Count: 0
      Repair Date: 00/00/1900 00:00:00
Old Repair Count: 0
  Last Consistent: (0x138,3FB,1A4)  03/19/2013 13:44:11
      Last Attach: (0x111,9,86)  03/19/2013 13:42:29
      Last Detach: (0x138,3FB,1A4)  03/19/2013 13:44:11
             Dbid: 1
    Log Signature: Create time:03/19/2013 09:40:14 Rand:11019164 Computer:
       OS Version: (6.1.7601 SP 1 NLS ffffffff.ffffffff)

Previous Full Backup:
        Log Gen: 0-0 (0x0-0x0)
           Mark: (0x0,0,0)
           Mark: 00/00/1900 00:00:00

Previous Incremental Backup:
        Log Gen: 0-0 (0x0-0x0)
           Mark: (0x0,0,0)
           Mark: 00/00/1900 00:00:00

Previous Copy Backup:
        Log Gen: 0-0 (0x0-0x0)
           Mark: (0x0,0,0)
           Mark: 00/00/1900 00:00:00

Previous Differential Backup:
        Log Gen: 0-0 (0x0-0x0)
           Mark: (0x0,0,0)
           Mark: 00/00/1900 00:00:00

Current Full Backup:
        Log Gen: 0-0 (0x0-0x0)
           Mark: (0x0,0,0)
           Mark: 00/00/1900 00:00:00

Current Shadow copy backup:
        Log Gen: 0-0 (0x0-0x0)
           Mark: (0x0,0,0)
           Mark: 00/00/1900 00:00:00

     cpgUpgrade55Format: 0
    cpgUpgradeFreePages: 0
cpgUpgradeSpaceMapPages: 0

       ECC Fix Success Count: none
   Old ECC Fix Success Count: none
         ECC Fix Error Count: none
     Old ECC Fix Error Count: none
    Bad Checksum Error Count: none
Old bad Checksum Error Count: none

  Last checksum finish Date: 03/19/2013 13:11:36
Current checksum start Date: 00/00/1900 00:00:00
      Current checksum page: 0

Operation completed successfully in 0.47 seconds.

MBX-1:

State: Clean Shutdown

Last Consistent: (0x138,3FB,1A4)  03/19/2013 13:44:11

Last Detach: (0x138,3FB,1A4)  03/19/2013 13:44:11

 

MBX-2:

State: Clean Shutdown

Last Consistent: (0x138,3FB,1A4)  03/19/2013 13:44:12

Last Detach: (0x138,3FB,1A4)  03/19/2013 13:44:12

 

MBX-3:

State: Clean Shutdown

Last Consistent: (0x138,3FB,1A4)  03/19/2013 13:44:13

Last Detach: (0x138,3FB,1A4)  03/19/2013 13:44:13

 

In this case, the values match across all copies so further steps can be performed.  If the values do not match across copies for any reason, do not continue and please contact Microsoft support.

 

7)  Reset the log file generation for the database.  Note: Use Get-MailboxDatabaseCopyStatus to record database locations and status prior to performing this activity.

 

Locate the log file directory for each ACTIVE (DISMOUNTED) database. Remove all log files from this directory first. Failure to remove log files from the ACTIVE (DISMOUNTED) database may result in the Replication service recopying log files, a failure of this procedure, and subsequent need to reseed all database copies. Note:  If log files are located in the same location as the database and catalog data folder, take precautions to not remove the database or the catalog data folder.

In our example MBX-1 hosts the ACTIVE (DISMOUNTED) copy.

[PS] C:\>Get-MailboxDatabaseCopyStatus SectorTest\*

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
----                                          ------          --------- ----------- --------------------   ------------
SectorTest\MBX-1                              Dismounted      0         0                                  Healthy
SectorTest\MBX-2                              Healthy         0         0           3/25/2013 5:41:54 AM   Healthy
SectorTest\MBX-3                              Healthy         0         0           3/25/2013 5:41:54 AM   Healthy

Locate the log file directory for each PASSIVE database. Remove all log files from this directory. Failure to remove all log files could result in this procedure failing, and the need to reseed this or all database copies. If log files are located in the same location as the database and catalog data folder take precautions to not remove the database or the catalog data folder.

 

In our example MBX-2 and MBX-3 host the passive database copies.

 

[PS] C:\>Get-MailboxDatabaseCopyStatus SectorTest\*

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
----                                          ------          --------- ----------- --------------------   ------------
SectorTest\MBX-1                              Dismounted      0         0                                  Healthy
SectorTest\MBX-2                              Healthy         0         0           3/25/2013 5:41:54 AM   Healthy
SectorTest\MBX-3                              Healthy         0         0           3/25/2013 5:41:54 AM   Healthy

 

8)  Mount the database using Mount-Database <DBNAME>, and verify the mount.

 

[PS] C:\>Mount-Database SectorTest

[PS] C:\>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
----                                          ------          --------- ----------- --------------------   ------------
SectorTest\MBX-1                              Mounted         0         0                                  Healthy
SectorTest\MBX-2                              Healthy         0         1           3/25/2013 5:57:28 AM   Healthy
SectorTest\MBX-3                              Healthy         0         1           3/25/2013 5:57:28 AM   Healthy

9)  Suspend and resume all passive database copies.

 

Get-mailboxdatabasecopy DBNAME\* | suspend-mailboxdatabasecopy  (Note:  The error on suspending the active database copy is expected.)

 

[PS] C:\>Get-MailboxDatabaseCopyStatus SectorTest\* | Suspend-MailboxDatabaseCopy
The suspend operation can't proceed because database 'SectorTest' on Exchange Mailbox server 'MBX-1' is the active mailbox database copy.
    + CategoryInfo          : InvalidOperation: (SectorTest\MBX-1:DatabaseCopyIdParameter) [Suspend-MailboxDatabaseCopy], InvalidOperationException
    + FullyQualifiedErrorId : 5083D28B,Microsoft.Exchange.Management.SystemConfigurationTasks.SuspendDatabaseCopy
    + PSComputerName        : mbx-1.exchange.msft

 

Get-mailboxdatabasecopy DBNAME\* | resume-mailboxdatabasecopy (Note:  The warning on resuming the active database copy is expected.)

 

[PS] C:\>Get-MailboxDatabaseCopyStatus SectorTest\* | Resume-MailboxDatabaseCopy
WARNING: The Resume operation won't have an effect on database replication because database 'SectorTest' hosted on server 'MBX-1' is the active mailbox database.

 

10)  Validate replication health.

 

[PS] C:\>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
----                                          ------          --------- ----------- --------------------   ------------
SectorTest\MBX-1                              Mounted         0         0                                  Healthy
SectorTest\MBX-2                              Healthy         0         0           3/19/2013 1:56:12 PM   Healthy
SectorTest\MBX-3                              Healthy         0         0           3/19/2013 1:56:12 PM   Healthy

 

11)  Using Set-MailboxDatabaseCopy, reconfigure any replay lag or truncation lag time on the database copy.  This example implements a 7 day replay lag time.

 

set-mailboxdatabasecopy –identity SectorTest\MBX-3 –replayLagTime 7.0:0:0

 

12)  Repeat the previous steps for all databases in the DAG including those databases that have a single copy.  DO NOT proceed to the next step until all databases have been reset.

 

13)  Enable block mode replication. Using registry editor navigate to HKLM –> Software –> Microsoft –> ExchangeServer –> V14 –> Replay, and then remove the DisableGranularReplication DWORD.

 

14)  Restart the replication service on each DAG member.

 

Stop-Server MSExchangeREPL

Start-Service MSExchangeREPL

15)  Validate database health using Get-MailboxDatabaseCopyStatus.

 

[PS] C:\>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
----                                          ------          --------- ----------- --------------------   ------------
SectorTest\MBX-1                              Healthy         0         0           3/19/2013 2:25:56 PM   Healthy
SectorTest\MBX-2                              Mounted         0         0                                  Healthy
SectorTest\MBX-3                              Healthy         0         230         3/19/2013 2:25:56 PM   Healthy

16)  Dump the header of a log file and verify that the new sector size is reflected in the log file stream. To do this, open a command prompt and navigate to the log file directory for the database on the active node. Run eseutil /ml against any log within the directory, and verify that the sector size reflects 4096 and (matches).

 

Z:\SectorTest>eseutil /ml E0100000001.log

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode...

      Base name: E01
      Log file: E0100000001.log
      lGeneration: 1 (0x1)
      Checkpoint: (0x17B,FFFF,FFFF)
      creation time: 03/19/2013 13:56:11
      prev gen time: 00/00/1900 00:00:00
      Format LGVersion: (7.3704.16.2)
      Engine LGVersion: (7.3704.16.2)
      Signature: Create time:03/19/2013 13:56:11 Rand:2996669 Computer:
      Env SystemPath: z:\SectorTest\
      Env LogFilePath: z:\SectorTest\
      Env Log Sec size: 4096 (matches)
      Env (CircLog,Session,Opentbl,VerPage,Cursors,LogBufs,LogFile,Buffers)
          (    off,   1227,  61350,  16384,  61350,   2048,    256,  44204)
      Using Reserved Log File: false
      Circular Logging Flag (current file): off
      Circular Logging Flag (past files): off
      Checkpoint at log creation time: (0x1,1,0)

      Last Lgpos: (0x1,2,0)

Number of database page references:  0

Integrity check passed for log file: E0100000001.log

Operation completed successfully in 0.250 seconds.

 

If the above steps have been completed successfully, and the log file sequence recognizes a 4096 sector size, then this issue has been resolved.

 

========================================

This guidance was validated in the following configurations:

 

  • Windows 2008 R2 Enterprise with Exchange 2010 Service Pack 2
  • Windows 2008 R2 Enterprise with Exchange 2010 Service Pack 3
  • Windows 2008 SP2 Enterprise with Exchange 2010 Service Pack 3
  • Windows 2010 Datacenter with Exchange 2010 Service Pack 3

========================================

Exchange 2010: Page Zeroing and VSS Based Backups

$
0
0

In Exchange Server 2010 Service Pack 1, page zeroing is enabled by default and there is no method to disable it. Page zeroing is a process that takes pages that exist in whitespace within the database and marks them with a pattern of zeros making them forensically unrecoverable.  The page zeroing process runs as part of the background maintenance process.

 

Recently we have investigated cases where page zeroing has led to larger than expected VSS backup data sets. Specifically these are for VSS-based products that perform a delta backup from a previous snapshot versus transferring a full data set with each backup. These might include but are not limited to:

 

  • System Center Data Protection Manager
  • Windows Server Backup (when leveraging non-network storage)
  • Hardware-based VSS providers

 

In all cases investigated there was an event that increased the amount of whitespace within the database. For example, multiple mailboxes were moved to another database leaving a more noticeable amount of whitespace within a given database.

 

Let’s look at an example:

 

The anchor backup is taken of a 500 GB database with 10 GB of associated log files.  This results in a transfer to backup medium of 510 GB.  Over the course of the day 6 GB of changes occur within the database (5 GB of actual user changes / 1 GB of page zeroing changes) with 10 GB of associated logs.  This yields a delta transfer to backup medium of 16 GB.  Over the course of time, delta transfers all float around 16 GB with standard usage patterns etc.  At some point the administrator migrates a group of users out of the database and this accounts for 30 GB of change.  The resulting backup to medium is now 30 GB + log files.  At this point, we expected the increase in delta transfer as there was an event that caused an increase in database activity.  What is not expected though, is that from this point forward, the delta transfers to backup medium continue to be greater than 30 GB.  In many cases this exceeds the expected snapshot size and storage allocated on the medium server.

 

In these instances, the page zeroing process continuously zeros pages that are already zeroed, thereby causing a daily delta change rate that corresponds with whitespace.

 

The following actions can be taken to correct this condition:

 

1)  Adjust snapshot storage to accommodate the larger delta data sets.

2)  Migrate all mailboxes out of the mailbox database and remove the mailbox database.

3)  Allow the whitespace to be recycled as additional mailboxes are added to the database.

4)  Offline defragment the database.

 

Exchange 2010 Service Pack 3 Rollup Update 1 has a code change that corrects the page zeroing behavior.

Part 7: Datacenter Activation Coordination: When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover…

$
0
0

When running Restore-DatabaseAvailabilityGroup as part of the datacenter switchover process, servers in the secondary datacenter are forced online from a quorum and cluster perspective, and servers in the primary datacenter are evicted from the DAG’s cluster. When nodes in the primary datacenter come back online and network connectivity is restored, these restored nodes are not aware that any changes to cluster membership have occurred.  The cluster services on the nodes in the primary datacenter will attempt to join/form a cluster with the nodes running in the secondary datacenter.  When this occurs, the nodes in the secondary datacenter inform the nodes in the primary datacenter that they were evicted. 

 

After a datacenter switchover has occurred, unless the original datacenter is gone or otherwise unrecoverable, eventually services in the primary datacenter will be restored.  When services are restored, including full network connectivity, database availability group (DAG) administrators can begin the switchback process by using the Start-DatabaseAvailabilityGroup cmdlet.

 

Before performing a switchback, you can perform the following tasks to verify that it is safe to run Start-DatabaseAvailabilityGroup for servers in the primary datacenter.

 

The first task is to ensure that the following events are present in the system log of the servers on the StoppedMailboxServers list:

 

Log Name:      System
Source:        Service Control Manager
Date:          5/27/2012 1:13:35 PM
Event ID:      7040
Task Category: None
Level:         Information
Keywords:      Classic
User:          SYSTEM
Computer:      MBX-1.exchange.msft
Description:
The start type of the Cluster Service service was changed from auto start to disabled.


Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          5/27/2012 1:13:35 PM
Event ID:      4621
Task Category: Cluster Evict/Destroy Cleanup
Level:         Information
Keywords:     
User:          SYSTEM
Computer:      MBX-1.exchange.msft
Description:
This node was sucessfully removed from the cluster.

Log Name:      System
Source:        Service Control Manager
Date:          5/27/2012 1:13:35 PM
Event ID:      7036
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-1.exchange.msft
Description:
The Cluster Service service entered the stopped state.


In this example, MBX-1 was informed of the eviction, and had it’s cluster services cleaned up and it’s Cluster service startup type set to disabled. The second task is to verify that the Cluster service startup type is set to Disabled. You can use the Services snap-in to verify this.

 

image

 

The third and last task is to verify that the cluster registry has been successfully cleaned up. This is an important step because any remnants of the cluster registry can lead the server to believe it is actually still in a cluster even though it has been evicted. You can use registry editor and navigate to HKEY_LOCAL_MACHINE (HKLM). If there is a hive called Cluster under the root of HKLM then the cleanup did not complete successfully.

 

Here is an example of a node where a successful cleanup was performed:

 

image

 

Here is an example of a node where the Cluster service has not been successfully cleaned up:

 

image

 

Anytime part of the cleanup process fails it typically means that Start-DatabaseAvailabilityGroup will also fail. If any of these three tasks show that cleanup did not complete successfully, it’s relatively easy to fix these issues. Administrators can force the cleanup to occur by running a cluster command.

 

Windows 2008:

 

Cluster node /force

 

Windows 2008 R2 / Windows 2012:

 

Import-Module FailoverCluters

Clear-CluserNode <NODENAME> –Force

 

Some administrators proactively include this as a step in their datacenter switchover documentation when bringing resources back to the primary datacenter. This is not a bad idea. Proactively running this command, even on a node was cleaned up successfully has no ill effects and eliminates the need to perform the three tasks listed above.

 

Therefore, I recommend administrators either incorporate the three tasks or proactively run the cleanup command as a part of their datacenter switchover procedures.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (http://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (http://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (http://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (http://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (http://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (http://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (http://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG... (http://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (http://aka.ms/Qlt035)

========================================================

Part 8: Datacenter Activation Coordination: Stop! In the Name of DAG…

$
0
0

Sometimes, even when following a specific process, it takes only one mistake to send the entire process off course.  Recently I’ve worked with several customers on their datacenter switchover steps that have found themselves unable to complete the process.  Let’s explore several examples of what happened…

====================================================================================

In the first example, we have a four member database availability group (DAG).  Two members are deployed in the primary datacenter along with the witness server, and the other two members are installed in a remote datacenter with an alternate witness server.  Each datacenter is an Active Directory site with a defined subnet.  In this example, AD site Exchange-A is the primary datacenter and AD site Exchange-B is the remote datacenter.  Here is an example network diagram:

 

image

 

 

In preparation for testing the witness server, MBX-1, MBX-2, and the router are powered down.  This leaves MBX-3 and MBX-4 in a lost quorum state in the remote datacenter.  The administrator starts the datacenter switchover process with Stop-DatabaseAvailabilityGroup, as shown in this example:

 

Stop-DatabaseAvailabilityGroup –identity DAG –ActiveDirectorySite Exchange-B –ConfigurationOnly:$TRUE –confirm:$FALSE

WARNING: Active Directory couldn't be updated in Exchange-A site(s) affected by the change to 'DAG'. It won't be completely usable until after Active Directory replication occurs.
An error caused a change in the current set of domain controllers.
    + CategoryInfo : NotSpecified: (0:Int32) [], ADServerSettingsChangedException + FullyQualifiedErrorId : 372697AD

Next, the cluster service is stopped on MBX-3 and MBX-4.

 

Stop-service clussvc

 

To complete the switchover, Restore-DatabaseAvailabilityGroup is used.

 

Restore-DatabaseAvailabilityGroup –identity DAG –ActiveDirectorySite:Exchange-B

WARNING: The operation wasn't successful because an error was encountered. You may find more details in log file
"C:\ExchangeSetupLogs\DagTasks\dagtask_2012-08-12_14-07-52.764_restore-databaseavailabilitygroup.log".
Unable to get the status of the cluster service on server 'MBX-2'. Error: 'Cannot open Service Control Manager on computer 'MBX-2'. This operation might require other privileges.'
    + CategoryInfo          : InvalidArgument: (:) [Restore-DatabaseAvailabilityGroup], FailedToGetServiceStatusForNodeException
    + FullyQualifiedErrorId : A9B129A5,Microsoft.Exchange.Management.SystemConfigurationTasks.RestoreDatabaseAvailabilityGroup

 

The command returns an error indicating that it cannot contact server MBX-2 in order to determine the status of the Cluster service.  Why is the task attempting to contact a server in the primary site that is down?  Using Get-DatabaseAvailabilityGroup to review the properties of the DAG shows us why:

 

Get-DatabaseAvailabilityGroup -Identity DAG | fl name,servers,startedmailboxservers,stoppedmailboxservers

Name                                   : DAG
Servers                                : {MBX-2, MBX-3, MBX-4, MBX-1}
StoppedMailboxServers                  : {MBX-4.exchange.msft, MBX-3.exchange.msft}

StartedMailboxServers                  : {MBX-1.exchange.msft, MBX-2.exchange.msft}

 

We can examine StoppedMailboxServers and note that MBX-3 and MBX-4 are on the stopped list when they should be on the started servers list.  This happened because in this instance the administrator stopped the wrong Active Directory site.  When using Stop-DatabaseAvailabilityGroup, the administrator should have specified site Exchange-A but accidentally specified Exchange-B.  This means the restore task is attempting to force the Cluster service on either MBX-1 or MBX-2 online and subsequently evict MBX-3 and MBX-4 from the cluster.

 

If this mistake is made, how do you fix it?  The first step that needs to be done is to correct the stopped and started servers list.  To do this, first stop the correct set of servers.

 

Stop-DatabaseAvailabilityGroup -Identity DAG -ActiveDirectorySite Exchange-A -ConfigurationOnly:$TRUE -Confirm:$FALSE


WARNING: Active Directory couldn't be updated in Exchange-A site(s) affected by the change to 'DAG'. It won't be completely usable until after Active Directory replication occurs.
An error caused a change in the current set of domain controllers.
    + CategoryInfo          : NotSpecified: (0:Int32) [], ADServerSettingsChangedException
    + FullyQualifiedErrorId : 372697AD

 

Next, use Get-DatabaseAvailabiltyGroup to confirm that all four servers in the DAG now appear on the StoppedMailboxSservers list.

 

Get-DatabaseAvailabilityGroup -Identity DAG | fl name,servers,stoppedmailboxservers,startedmailboxservers

Name                                   : DAG
Servers                                : {MBX-2, MBX-3, MBX-4, MBX-1}
StoppedMailboxServers                  : {MBX-2.exchange.msft, MBX-1.exchange.msft, MBX-4.exchange.msft, MBX-3.exchange.msft}
StartedMailboxServers                  : {}

 

The second step requires starting the servers in the remote datacenter.  Start-DatabaseAvailabilityGroup can be used to do this, as shown in the following example:

 

Start-DatabaseAvailabilityGroup -Identity DAG -ActiveDirectorySite Exchange-B

WARNING: Active Directory couldn't be updated in Exchange-A site(s) affected by the change to 'DAG'. It won't be completely usable until after Active Directory replication occurs.
An Active Manager operation failed. Error An error occurred while attempting a cluster operation. Error: Cluster API 'OpenByNames('MBX-3.exchange.msft', 'MBX-4.exchange.msft') failed for each server. Specific exceptions: 'An Active Manager operation failed. Error An error occurred while attempting a cluster operation. Error: Cluster API '"OpenCluster(MBX-
3.exchange.msft) failed with 0x6d9. Error: There are no more endpoints available from the endpoint mapper"' failed..', 'An Active Manager operation failed. Error An error occurred while attempting a cluster operation. Error: Cluster API '"OpenCluster(MBX-4.exchange.msft) failed with 0x6d9. Error: There are no more endpoints available from the endpoint mapper"' failed..'.' failed.. + CategoryInfo          : NotSpecified: (0:Int32) [Start-DatabaseAvailabilityGroup], AmClusterApiException
    + FullyQualifiedErrorId : BA1A902A,Microsoft.Exchange.Management.SystemConfigurationTasks.StartDatabaseAvailabilityGroup

An error caused a change in the current set of domain controllers.


    + CategoryInfo : NotSpecified: (0:Int32) [], ADServerSettingsChangedException
    + FullyQualifiedErrorId : 372697AD

 

The failures that are displayed are expected.  The Cluster services on the nodes are not in a started state at this time.  Using Get-DatabaseAvailabilityGroup we note that the servers listed are correct for both the StartedMailboxServers and StoppedMailboxServers list.

 

Get-DatabaseAvailabilityGroup -Identity DAG | fl name,servers,stoppedmailboxservers,startedmailboxservers

Name                                   : DAG
Servers                                : {MBX-2, MBX-3, MBX-4, MBX-1}
StoppedMailboxServers                  : {MBX-2.exchange.msft, MBX-1.exchange.msft}
StartedMailboxServers                  : {MBX-3.exchange.msft, MBX-4.exchange.msft}

 

The third step is to ensure the Cluster service is stopped on each node, which can be accomplished by using Stop-Service.

 

Stop-Service ClusSvc

 

The last step is to use Restore-DatabaseAvailabiltyGroup.  This cmdlet will complete the datacenter switchover process by forcing the Cluster service to start and by evicting the nodes on the StoppedMailboxServers list.

 

Restore-DatabaseAvailabilityGroup -Identity DAG -ActiveDirectorySite Exchange-B -Confirm:$FALSE

WARNING: The Exchange Trusted Subsystem is not a member of the local Administrators group on specified witness server dc-2.exchange.msft.

 

This completes the datacenter switchover for the database availability group.  The procedure can now continue with database activation and changes required for client access.

 

====================================================================================

 

In the second example we have a four-member DAG. Two members are in the primary datacenter with the witness server, and two members are in a remote datacenter with an alternate witness server configured. Both datacenters are in the same Active Directory site. Here is an example network diagram:

 

image

 

In preparation for testing the witness server, MBX-1, MBX-2, and the router are powered down. This leaves MBX-3 and MBX-4 in a lost quorum state in the remote datacenter. So the administrator starts the datacenter switchover process by issuing Stop-DatabaseAvailabilityGroup, as illustrated in the following example:

 

Stop-DatabaseAvailabilityGroup -Identity DAG -ActiveDirectorySite Exchange -ConfigurationOnly:$TRUE -Confirm:$FALSE
WARNING: Active Directory couldn't be updated in Exchange site(s) affected by the change to 'DAG'. It won't be completely usable until after Active Directory replication occurs.

 

Next, the Cluster service is stopped on MBX-3 and MBX-4.

 

Stop-service ClusSvc

To complete the activation, Restore-DatabaseAvailabilityGroup is issued.

 

Restore-DatabaseAvailabilityGroup -Identity DAG -ActiveDirectorySite Exchange -Confirm:$FALSE

WARNING: The operation wasn't successful because an error was encountered. You may find more details in log file "C:\ExchangeSetupLogs\DagTasks\dagtask_2012-08-12_16-57-27.326_restore-databaseavailabilitygroup.log". Unable to form quorum for database availability group 'DAG'. Please try the operation again, or run the Restore-DatabaseAvailabilityGroup cmdlet and specify the site with servers known to be running.
    + CategoryInfo          : InvalidArgument: (:) [Restore-DatabaseAvailabilityGroup], DagTaskQuorumNotAchievedException
    + FullyQualifiedErrorId : C7FE0CB9,Microsoft.Exchange.Management.SystemConfigurationTasks.RestoreDatabaseAvailabilityGroup

 

The command returns an error indicating that a quorum cannot be formed because no servers are known to be running.  Why has this occurred?  Using Get-DatabaseAvailabilityGroup we can review the properties of the DAG:

 

Get-DatabaseAvailabilityGroup -Identity DAG | fl name,servers,stoppedmailboxservers,startedmailboxservers

Name                  : DAG
Servers               : {MBX-2, MBX-3, MBX-4, MBX-1}
StoppedMailboxServers : {MBX-1.exchange.msft, MBX-4.exchange.msft, MBX-3.exchange.msft, MBX-2.exchange.msft}
StartedMailboxServers : {}

 

Specifically we are interested in StoppedMailboxServers.  In this example, all four DAG members appear in the StoppedMailboxServers list.  Why is that?  In our scenario, all Exchange servers are in the same Active Directory site.  The administrator issued Stop-DatabaseAvailabiltyGroup command with the ActiveDirectorySite parameter when instead the MailboxServer parameter should have been used. The MailboxServer parameter was needed so that the administrator could stop individual servers instead of all of the servers in the same site.

 

If this mistake is made, you can recover from it fairly easily.  The first step is to fix the started and stopped mailbox server lists.  You can use Start-DatabaseAvailabilityGroup to correct this.

 

Start-DatabaseAvailabilityGroup -Identity DAG -MailboxServer MBX-3

An Active Manager operation failed. Error An error occurred while attempting a cluster operation. Error: Cluster API 'OpenByNames('MBX-3.exchange.msft') failed for each server. Specific exceptions: 'An Active Manager operation failed. Error An error occurred while attempting a cluster operation. Error: Cluster API '"OpenCluster(MBX-3.exchange.msft) failed with 0x6d9. Error: There are no more endpoints available from the endpoint mapper"' failed..'.' failed..
    + CategoryInfo          : NotSpecified: (0:Int32) [Start-DatabaseAvailabilityGroup], AmClusterApiException
    + FullyQualifiedErrorId : CE668F87,Microsoft.Exchange.Management.SystemConfigurationTasks.StartDatabaseAvailabilityGroup

 

Start-DatabaseAvailabilityGroup -Identity DAG -MailboxServer MBX-4

An Active Manager operation failed. Error An error occurred while attempting a cluster operation. Error: Cluster API 'OpenByNames('MBX-3.exchange.msft', 'MBX-4.exchange.msft') failed for each server. Specific exceptions: 'An Active Manager operation failed. Error An error occurred while attempting a cluster operation. Error: Cluster API '"OpenCluster(MBX-
3.exchange.msft) failed with 0x6d9. Error: There are no more endpoints available from the endpoint mapper"' failed..','An Active Manager operation failed. Error An error occurred while attempting a cluster operation. Error: Cluster API '"OpenCluster(MBX-4.exchange.msft) failed with 0x6d9. Error: There are no more endpoints available from the endpoint mapper"' failed..'.' failed..
    + CategoryInfo          : NotSpecified: (0:Int32) [Start-DatabaseAvailabilityGroup], AmClusterApiException
    + FullyQualifiedErrorId : BB89A63D,Microsoft.Exchange.Management.SystemConfigurationTasks.StartDatabaseAvailabilityGroup

 

The failures that are displayed are expected. The Cluster services on the DAG members is not started at this time. We can use Get-DatabaseAvailabliityGroup to verify that the StartedMailboxServers and StoppedMailboxServers lists are correct.

 

Get-DatabaseAvailabilityGroup -Identity DAG | fl name,servers,stoppedmailboxservers,startedmailboxservers

Name                  : DAG
Servers               : {MBX-2, MBX-3, MBX-4, MBX-1}
StoppedMailboxServers : {MBX-1.exchange.msft, MBX-2.exchange.msft}
StartedMailboxServers : {MBX-4.exchange.msft, MBX-3.exchange.msft}

 

The second step is to ensure that the Cluster service is stopped on MBX-3 and MBX-4.

 

Stop-Server ClusSvc

 

The last step is to run Restore-DatabaseAvailabilityGroup command. This will complete the datacenter switchover process by forcing the Cluster service to start and by evicting the nodes on the stopped servers list.

 

Restore-DatabaseAvailabilityGroup -Identity DAG -ActiveDirectorySite Exchange -Confirm:$FALSE

WARNING: The Exchange Trusted Subsystem is not a member of the local Administrators group on specified witness server dc-2.exchange.msft.

 

This completes the datacenter switchover for the database availability group. The procedure can now continue with database activation and changes required for client access.

 

====================================================================================

In the last example we have a four-member DAG. Two members are installed in a primary datacenter with the witness server, and two members are installed in a remote datacenter with an alternate witness server configured. Both datacenters are in the same Active Directory site. The same situation described in this example can occur when multiple Active Directory sites are used, but in my experience, this problem most commonly occurs with just a single Active Directory site.  Here is an example network diagram:

 

image

 

In preparation for testing the witness server, MBX-1, MBX-2, and the router are powered down. This leaves MBX-3 and MBX-4 in a lost quorum state in the remote datacenter. So, the administrator starts the datacenter switchover process with Stop-DatabaseAvailabilityGroup:

 

Stop-DatabaseAvailabilityGroup -Identity DAG -ActiveDirectorySite Exchange -Confirm:$FALSE

 

Next, the cluster service is stopped on MBX-3 and MBX-4.

 

Stop-service ClusSvc

 

Finally, Restore-DatabaseAvailabilityGroup is issued.

 

Restore-DatabaseAvailabilityGroup -Identity DAG -ActiveDirectorySite Exchange -Confirm:$FALSE

WARNING: The operation wasn't successful because an error was encountered. You may find more details in log file "C:\ExchangeSetupLogs\DagTasks\dagtask_2012-08-12_16-57-27.326_restore-databaseavailabilitygroup.log". Unable to form quorum for database availability group 'DAG'. Please try the operation again, or run the Restore-DatabaseAvailabilityGroup cmdlet and specify the site with servers known to be running.
+ CategoryInfo : InvalidArgument: (:) [Restore-DatabaseAvailabilityGroup], DagTaskQuorumNotAchievedException
+ FullyQualifiedErrorId : C7FE0CB9,Microsoft.Exchange.Management.SystemConfigurationTasks.RestoreDatabaseAvailabilityGroup

 

As with the previous examples, the problem because the administrator issued the Stop-DatabaseAvailabilityGroup command and all servers were added to the stopped servers list. This is verified with Get-DatabaseAvailabilityGroup.

 

Get-DatabaseAvailabilityGroup -Identity DAG | fl name,servers,stoppedmailboxservers,startedmailboxservers

Name : DAG
Servers : {MBX-2, MBX-3, MBX-4, MBX-1}
StoppedMailboxServers : {MBX-1.exchange.msft, MBX-4.exchange.msft, MBX-3.exchange.msft, MBX-2.exchange.msft}
StartedMailboxServers : {}

 

The extent of the issue is realized when we attempt to correct the started and stopped mailbox server lists and proceed with the switchover process.  As with the previous examples, we use Start-DatabaseAvailabilityGroup with the MailboxServer parameter to start the individual servers in the remote datacenter.

 

Start-DatabaseAvailabilityGroup -Identity DAG -MailboxServer MBX-3

An Active Manager operation failed. Error An error occurred while attempting a cluster operation. Error: Cluster API 'OpenByNames('MBX-3.exchange.msft') failed for each server. Specific exceptions: 'An Active Manager operation failed. Error An error occurred while attempting a cluster operation. Error: Cluster API '"OpenCluster(MBX-3.exchange.msft) failed with 0x6d9. Error: There are no more endpoints available from the endpoint mapper"' failed..'.' failed..
+ CategoryInfo : NotSpecified: (0:Int32) [Start-DatabaseAvailabilityGroup], AmClusterApiException
+ FullyQualifiedErrorId : CE668F87,Microsoft.Exchange.Management.SystemConfigurationTasks.StartDatabaseAvailabilityGroup

 

Start-DatabaseAvailabilityGroup -Identity DAG -MailboxServer MBX-4

An Active Manager operation failed. Error An error occurred while attempting a cluster operation. Error: Cluster API 'OpenByNames('MBX-3.exchange.msft', 'MBX-4.exchange.msft') failed for each server. Specific exceptions: 'An Active Manager operation failed. Error An error occurred while attempting a cluster operation. Error: Cluster API '"OpenCluster(MBX-3.exchange.msft) failed with 0x6d9. Error: There are no more endpoints available from the endpoint mapper"' failed..','An Active Manager operation failed. Error An error occurred while attempting a cluster operation. Error: Cluster API '"OpenCluster(MBX-4.exchange.msft) failed with 0x6d9. Error: There are no more endpoints available from the endpoint mapper"' failed..'.' failed..
+ CategoryInfo : NotSpecified: (0:Int32) [Start-DatabaseAvailabilityGroup], AmClusterApiException
+ FullyQualifiedErrorId : BB89A63D,Microsoft.Exchange.Management.SystemConfigurationTasks.StartDatabaseAvailabilityGroup

 

The failures that are displayed are expected because the Cluster services on the DAG members are not in a started state. Using Get-DatabaseAvailabliityGroup, we note that the servers are correct on both the StartedMailboxServers and StoppedMailboxServers list.

 

Get-DatabaseAvailabilityGroup -Identity DAG | fl name,servers,stoppedmailboxservers,startedmailboxservers

Name : DAG
Servers : {MBX-2, MBX-3, MBX-4, MBX-1}
StoppedMailboxServers : {MBX-1.exchange.msft, MBX-2.exchange.msft}
StartedMailboxServers : {MBX-4.exchange.msft, MBX-3.exchange.msft}

 

Using Stop-Service, we stop the Cluster service on each server.

 

Stop-Service ClusSvc

 

Finally, the last step should be run Restore-DatabaseAvailabilityGroup force quorum on the remaining servers and evict the servers on the stopped servers list. 

 

Restore-DatabaseAvailabilityGroup -Identity DAG -ActiveDirectorySite Exchange -Confirm:$FALSE

WARNING: The operation wasn't successful because an error was encountered. You may find more details in log file "C:\ExchangeSetupLogs\DagTasks\dagtask_2012-08-12_17-55-16.974_restore-databaseavailabilitygroup.log". Couldn't start the Cluster service on 'MBX-3'. Service state: Stopped. Try forcing the cluster to start without quorum by running "net start clussvc /fq" from a command prompt on that node. + CategoryInfo          : InvalidArgument: (:) [Restore-DatabaseAvailabilityGroup], FailedToStartClusSvcException
    + FullyQualifiedErrorId : 6CD04940,Microsoft.Exchange.Management.SystemConfigurationTasks.RestoreDatabaseAvailabilityGroup

 

As shown above, Restore-DatabaseAvailabilityGroup failed because it failed to successfully start the Cluster service on MBX-3 using force quorum.  The error suggests that the administrator should attempt to manually start the service with /forcequorum.

 

net start clussvc /fq

System error 1058 has occurred. The service cannot be started, either because it is disabled or because it has no enabled devices associated with it.

 

After attempting to manually start the Cluster service with /forceQuorum the above error is displayed, which indicates that the Cluster service is not installed. 

When reviewing Service Control Manager, we note that the Cluster service on the remaining members is set to Disabled.

 

image

 

When reviewing the system event log, we see the following event at or about the time the Stop-DatabaseAvailabilityGroup was issued.

 

Log Name:      System
Source:        Service Control Manager
Date:          8/12/2012 10:30:07 AM
Event ID:      7040
Task Category: None
Level:         Information
Keywords:      Classic
User:          SYSTEM
Computer:      MBX-3.exchange.msft
Description:
The start type of the Cluster Service service was changed from auto start to disabled.

 

This is where the extent of the mistake is exposed. Stop-DatabaseAvailabilityGroup was not only run against servers that should not have been stopped, but it was also run without the ConfigurationOnly parameter. When the cmdlet is run without the ConfigurationOnly parameter, any servers that are being stopped that are accessible will have their Cluster service forcibly cleaned up. This in turn prevents Restore-DatabaseAvailabilityGroup from being successful.

 

In order to overcome this situation the administrator must re-establish the Cluster and then proceed with database activation.  The first step is to ensure that the Cluster service is completely cleaned up from the DAG members in the remote datacenter.

 

Windows 2008:

Cluster Node /force

Attempting to clean up node '' ...
Clean up successfully completed.

 

Windows 2008 R2:

Import-Module FailoverClusters


Clear-ClusterNode -Force -Verbose -Confirm:$FALSE

VERBOSE: Performing operation "Clear-ClusterNode" on Target "MBX-3".
VERBOSE: Clearing cluster node MBX-3
.

 

The second step is to use Active Directory Users and Computers to locate the DAG’s CNO. Right-click the CNO and select RESET, and then right-click the CNO and select disable. Allow sufficient time for the disabled account to replicate around Active Directory.

 

The third step is to manually create the cluster.  There are three methods to manually create the cluster.

 

Windows 2008 and Windows 2008 R2 utilizing Failover Cluster Manager:

Launch Failover Cluster Manager.

 

image

 

In the upper right corner select “Create a cluster…”

 

image

 

In the “Before you begin” dialog, select Next.

 

image

 

On the “Selected Server” dialog enter the server names of all servers in the remote datacenter.  In our example, we will add MBX-3 and MBX-4.  Select the Add button after each server name.  Select Next when completed.

 

image

 

On the “Validation Warning” select NO.  Select Next when completed.

 

image

 

On the “Access Point for Administering the Cluster” in the “Cluster Name:” field, enter the name of the DAG.  In our example we will use DAG (creative eh?).  In the networks dialog enter the IP address assigned to the DAG in the remote datacenter (if you are not sure you can use Get-DatabaseAvailabilityGroup | fl name,databaseavailabilitygroupipaddresses to list the IP addresses assigned to the DAG).  Select Next when complete.

 

image

 

On the “Confirmation” select Next.

 

image

 

At this time the Cluster service should be configured on both servers.  On the “Summary” select Finish.

 

image

 

The last step is to use the Exchange Management Shell and run the following command:

 

Set-DatabaseAvailabilityGroup –identity DAG

 

By running this command and not specifying any values this will ensure that the DAG settings from Active Directory are applied to the new cluster.

 

Windows 2008 Command Line:

Cluster.exe DAGNAME /create /nodes:”NODE1 NODE2 NODE3” /ipaddress:”IP/Subnet”

C:\>cluster.exe DAG /create /nodes:"MBX-3 MBX-4" /ipAddress:"192.168.1.20/24"
  4% Initializing Cluster DAG.
  9% Validating cluster state on node MBX-3.
13% Searching the domain for computer object DAG
18% Verifying computer object DAG in the domain
22% Configuring computer object DAG as cluster name object
27% Validating installation of the Microsoft Failover Cluster Virtual Adapter on node MBX-3.
31% Validating installation of the Cluster Disk Driver on node MBX-3.
36% Configuring Cluster Service on node MBX-3.
40% Validating installation of the Microsoft Failover Cluster Virtual Adapter on node MBX-4.
45% Validating installation of the Cluster Disk Driver on node MBX-4.
50% Configuring Cluster Service on node MBX-4.
54% Starting Cluster Service on node MBX-3.
54% Starting Cluster Service on node MBX-4.
59% Forming cluster DAG.
63% Adding cluster common properties to DAG.
68% Creating resource types on cluster DAG.
72% Creating group 'Cluster Group'.
72% Creating group 'Available Storage'.
77% Creating IP Address resource 'Cluster IP Address'.
81% Creating Network Name resource 'DAG'.
86% Searching the domain for computer object DAG
90% Verifying computer object DAG in the domain
95% Configuring computer object DAG as cluster name object

100% Bringing resource group 'Cluster Group' online.

Windows 2008 R2 Powershell:

Import-Module FailoverClusters

New-Cluster –name DAGNAME –node NODE1,NODE2,NODE3 /staticIP:IPAddress /noStorage

[PS] C:\>Import-Module FailoverClusters

[PS] C:\>New-Cluster -Name DAG -Node MBX-3,MBX-4 -StaticAddress 192.168.1.20 -NoStorage
Report file location: C:\Windows\cluster\Reports\Create Cluster Wizard DAG on 2013.08.12 At 12.09.55.mht

Name
----
DAG

At this time, the started and stopped mailbox server lists are accurate, and the Cluster service for the DAG has been re-established.  To ensure the configuration is correct the administrator can run Set-DatabaseAvailabilityGroup.  This will ensure that the DAG configuration in Active Directory matches the cluster configuration.

This completes the datacenter switchover for the database availability group. The procedure can now continue with database activation and changes required for client access.

 

====================================================================================

 

This blog post covers three common scenarios I see where administrators make mistakes when using Stop-DatabaseAvailabilityGroup. When used incorrectly, the cmdlet can have unintended results and the steps outlined here can be used to work around them.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (http://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (http://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (http://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (http://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (http://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (http://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (http://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG... (http://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (http://aka.ms/Qlt035)

========================================================

Part 9: Datacenter Activation Coordination: An error caused a change in the current set of domain controllers

$
0
0

As a part of a datacenter switchover process, administrators run Stop-DatabaseAvailabilityGroup to stop DAG members in the failed datacenter.  This cmdlet is responsible for updating the stoppedMailboxServers attribute of the DAG object within Active Directory.

 

When this command is run and multiple AD sites are involved, the command attempts to force AD replication between the sites so that all AD sites are aware of the stopped mailbox servers.  This allows us to bypass issues that can arise when non-default replication times are used on AD site replication connections. 

 

In many cases, not only are the Exchange severs in the primary datacenter failed, but so are the supporting domain controllers for that AD site.  There may also be scenarios where the remote site where the command is being executed has no network connectivity to domain controllers in the primary site.  When this occurs, Stop-DatabaseAvailabilityGroup fails with the following error:

 

[PS] C:\>Stop-DatabaseAvailabilityGroup -ActiveDirectorySite Exchange-B -ConfigurationOnly:$TRUE -Identity DAG

Confirm
Are you sure you want to perform this action?
Stopping Mailbox servers for Active Directory site "Exchange-B" in database availability group "DAG".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): a
WARNING: Active Directory couldn't be updated in Exchange-B site(s) affected by the change to 'DAG'. It won't be
completely usable until after Active Directory replication occurs.
An error caused a change in the current set of domain controllers.
    + CategoryInfo          : NotSpecified: (0:Int32) [], ADServerSettingsChangedException
    + FullyQualifiedErrorId : 3647E7F3

 

If this error occurs as a result of issuing Stop-DatabaseAvailbilityGroup when known connectivity issues exist to domain controllers in the site hosting the Exchange servers that are being stopped, the error can be safely ignored.  When domain controllers come back up in the primary datacenter normal Active Directory replication will handle populating this attribute on those domain controllers and other safeguards exist in the product not necessitating this attribute be updated for the solution to function.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (http://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (http://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (http://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (http://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (http://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (http://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (http://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG... (http://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (http://aka.ms/Qlt035)

========================================================

Exchange 2013: OutlookMailboxDeeptTestProbe fails with Access Denied

$
0
0

I recently had an opportunity to work with a customer who was experiencing unexpected database failovers between nodes.  When looking at the ProbeResults log within the ActiveMonitoring crimson event channel the following error event was noted (full event text at end of post):

 

Log Name:      Microsoft-Exchange-ActiveMonitoring/ProbeResult
Source:        Microsoft-Exchange-ActiveMonitoring
Date:          7/8/2013 12:08:31 PM
Event ID:      2
Task Category: Probe result
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      SERVER.domain.com
Description:
Probe result (Name=OutlookMailboxDeepTestProbe/DATABASE)

 

Contained within the event is the following:

 

<Error>Error 0x5 (Access is denied) from ClientAsyncCallState.CheckCompletion: RpcAsyncCompleteCall

 

In the process of troubleshooting we determined that the LMCompatabilityLevel was set to 1.  It should be noted that the default for Windows 2003 is a LMCompatabilityLevel of 2.  When the LMCompatabilityLevel is set to 1 this causes the authentication mechanism that the health service utilizes for these probes to fail.  Due to the number of logon failures to the same database the health service attempts to failover the database to another server to correct the condition.  This does not correct the condition since the issue is domain level authentication and the LMCompatabilityLevel is consistent across all nodes.

 

This issue also impacted the OutlookSelfTestProbe and may result in failure of the RPC Client Access service terminating and restarting.  When setting the LMCompatabilityLevel to 1 this forces the enablement of LanManHash.  This causes the winhttpAutoLogon setting to change to high and no default credentials are sent with requests.

 

Information on LMCompatibilityLevel can be found here:  http://technet.microsoft.com/en-us/library/cc960646.aspx

 

To correct this condition the LMCompatibilityLevel in the registry of the Exchange servers and all domain controllers was changed to 2 – and the servers were rebooted.

 

================================================================================

Log Name:      Microsoft-Exchange-ActiveMonitoring/ProbeResult
Source:        Microsoft-Exchange-ActiveMonitoring
Date:          7/8/2013 12:08:31 PM
Event ID:      2
Task Category: Probe result
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      SERVER.domain.com
Description:
Probe result (Name=OutlookMailboxDeepTestProbe/DATABASE)
Event Xml:
<Event xmlns="
http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Exchange-ActiveMonitoring" Guid="{ECD64F52-A3BC-47B8-B681-A11B7A1C8770}" />
    <EventID>2</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>2</Task>
    <Opcode>0</Opcode>
    <Keywords>0x800000000000000</Keywords>
    <TimeCreated SystemTime="2013-07-08T19:08:31.121798100Z" />
    <EventRecordID>3319929</EventRecordID>
    <Correlation />
    <Execution ProcessID="49060" ThreadID="43444" />
    <Channel>Microsoft-Exchange-ActiveMonitoring/ProbeResult</Channel>
    <Computer>SERVER.domain.com</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <UserData>
    <EventXML xmlns:auto-ns2="
http://schemas.microsoft.com/win/2004/08/events" xmlns="myNs">
      <ResultId>3320940</ResultId>
      <ServiceName>Outlook.Protocol</ServiceName>
      <IsNotified>0</IsNotified>
      <ResultName>OutlookMailboxDeepTestProbe/DATABASE</ResultName>
      <WorkItemId>251</WorkItemId>
      <DeploymentId>0</DeploymentId>
      <MachineName>SERVER</MachineName>
      <Error>Error 0x5 (Access is denied) from ClientAsyncCallState.CheckCompletion: RpcAsyncCompleteCall
EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
EEInfo: Generation Time: 2013-07-08 19:08:31.121
EEInfo: Generating component: 2
EEInfo: Status: 0x00000005
EEInfo: Detection location: 1710
EEInfo: Flags: 0
EEInfo: NumberOfParameters: 1
EEInfo: prm[0]: Long val: 0 (0x00000000)

EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
EEInfo: Generation Time: 2013-07-08 19:08:31.121
EEInfo: Generating component: 14
EEInfo: Status: 0x00000005
EEInfo: Detection location: 1398
EEInfo: Flags: 0
EEInfo: NumberOfParameters: 2
EEInfo: prm[0]: Long val: 4 (0x00000004)
EEInfo: prm[1]: Long val: -1073606612 (0xC002102C)

EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
EEInfo: Generation Time: 2013-07-08 19:08:31.121
EEInfo: Generating component: 13
EEInfo: Status: 0xC002102C
EEInfo: Detection location: 1401
EEInfo: Flags: 0
EEInfo: NumberOfParameters: 0

EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
EEInfo: Generation Time: 2013-07-08 19:08:31.121
EEInfo: Generating component: 13
EEInfo: Status: 0x00000191
EEInfo: Detection location: 1417
EEInfo: Flags: 0
EEInfo: NumberOfParameters: 1
EEInfo: prm[0]: Unicode string: Unauthorized

EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
EEInfo: Generation Time: 2013-07-08 19:08:31.106
EEInfo: Generating component: 13
EEInfo: Status: 0x00000000
EEInfo: Detection location: 3041
EEInfo: Flags: 0
EEInfo: NumberOfParameters: 0

EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
EEInfo: Generation Time: 2013-07-08 19:08:31.106
EEInfo: Generating component: 13
EEInfo: Status: 0x00000000
EEInfo: Detection location: 3041
EEInfo: Flags: 0
EEInfo: NumberOfParameters: 0
</Error>
      <Exception>Microsoft.Exchange.Rpc.RpcException: Error 0x5 (Access is denied) from ClientAsyncCallState.CheckCompletion: RpcAsyncCompleteCall
EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
EEInfo: Generation Time: 2013-07-08 19:08:31.121
EEInfo: Generating component: 2
EEInfo: Status: 0x00000005
EEInfo: Detection location: 1710
EEInfo: Flags: 0
EEInfo: NumberOfParameters: 1
EEInfo: prm[0]: Long val: 0 (0x00000000)

EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
EEInfo: Generation Time: 2013-07-08 19:08:31.121
EEInfo: Generating component: 14
EEInfo: Status: 0x00000005
EEInfo: Detection location: 1398
EEInfo: Flags: 0
EEInfo: NumberOfParameters: 2
EEInfo: prm[0]: Long val: 4 (0x00000004)
EEInfo: prm[1]: Long val: -1073606612 (0xC002102C)

EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
EEInfo: Generation Time: 2013-07-08 19:08:31.121
EEInfo: Generating component: 13
EEInfo: Status: 0xC002102C
EEInfo: Detection location: 1401
EEInfo: Flags: 0
EEInfo: NumberOfParameters: 0

EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
EEInfo: Generation Time: 2013-07-08 19:08:31.121
EEInfo: Generating component: 13
EEInfo: Status: 0x00000191
EEInfo: Detection location: 1417
EEInfo: Flags: 0
EEInfo: NumberOfParameters: 1
EEInfo: prm[0]: Unicode string: Unauthorized

EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
EEInfo: Generation Time: 2013-07-08 19:08:31.106
EEInfo: Generating component: 13
EEInfo: Status: 0x00000000
EEInfo: Detection location: 3041
EEInfo: Flags: 0
EEInfo: NumberOfParameters: 0

EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
EEInfo: Generation Time: 2013-07-08 19:08:31.106
EEInfo: Generating component: 13
EEInfo: Status: 0x00000000
EEInfo: Detection location: 3041
EEInfo: Flags: 0
EEInfo: NumberOfParameters: 0

   at Microsoft.Exchange.Rpc.ClientAsyncCallState.CheckCompletion()
   at Microsoft.Exchange.Rpc.ExchangeClient.ClientAsyncCallState_Connect.End(IntPtr&amp; contextHandle, TimeSpan&amp; pollsMax, Int32&amp; retryCount, TimeSpan&amp; retryDelay, String&amp; dnPrefix, String&amp; displayName, Int16[]&amp; DATABASEVersion, ArraySegment`1&amp; segmentExtendedAuxOut)
   at Microsoft.Exchange.Rpc.ExchangeClient.ExchangeAsyncRpcClient.EndConnect(ICancelableAsyncResult result, IntPtr&amp; contextHandle, TimeSpan&amp; pollsMax, Int32&amp; retryCount, TimeSpan&amp; retryDelay, String&amp; dnPrefix, String&amp; displayName, Int16[]&amp; DATABASEVersion, ArraySegment`1&amp; segmentExtendedAuxOut)
   at Microsoft.Exchange.RpcClientAccess.Monitoring.EmsmdbClient.ConnectCallContext.OnEnd(ICancelableAsyncResult asyncResult)
   at Microsoft.Exchange.RpcClientAccess.Monitoring.ClientCancelableCallContext`1.&lt;InternalEnd&gt;b__3(ICancelableAsyncResult r)
   at Microsoft.Exchange.RpcClientAccess.Monitoring.ClientCancelableCallContext`1.DeferExceptions[TArgIn](Action`1 guardedAction, TArgIn arg)</Exception>
      <RetryCount>0</RetryCount>
      <StateAttribute1>[null]</StateAttribute1>
      <StateAttribute2>SERVER.domain.com</StateAttribute2>
      <StateAttribute3>SERVER.domain.com</StateAttribute3>
      <StateAttribute4>HealthMailboxed1e4e23d3ba446aaccec3ae4e13c600</StateAttribute4>
      <StateAttribute5>{AC73C7EA-935A-4EBA-8B31-E9ECA8430D2A}</StateAttribute5>
      <StateAttribute6>0</StateAttribute6>
      <StateAttribute7>0</StateAttribute7>
      <StateAttribute8>0</StateAttribute8>
      <StateAttribute9>0</StateAttribute9>
      <StateAttribute10>0</StateAttribute10>
      <StateAttribute11>EMSMDB.Connect()</StateAttribute11>
      <StateAttribute12>[21] -EMSMDB.Connect(); </StateAttribute12>
      <StateAttribute13>[null]</StateAttribute13>
      <StateAttribute14>[null]</StateAttribute14>
      <StateAttribute15>[null]</StateAttribute15>
      <StateAttribute16>0</StateAttribute16>
      <StateAttribute17>0</StateAttribute17>
      <StateAttribute18>0</StateAttribute18>
      <StateAttribute19>0</StateAttribute19>
      <StateAttribute20>0</StateAttribute20>
      <StateAttribute21>[null]</StateAttribute21>
      <StateAttribute22>[null]</StateAttribute22>
      <StateAttribute23>[null]</StateAttribute23>
      <StateAttribute24>[null]</StateAttribute24>
      <StateAttribute25>[null]</StateAttribute25>
      <ResultType>4</ResultType>
      <ExecutionId>53740913</ExecutionId>
      <ExecutionStartTime>2013-07-08T19:08:31.1061979Z</ExecutionStartTime>
      <ExecutionEndTime>2013-07-08T19:08:31.1217981Z</ExecutionEndTime>
      <PoisonedCount>0</PoisonedCount>
      <ExtensionXml>[null]</ExtensionXml>
      <SampleValue>21.6446</SampleValue>
      <ExecutionContext>    Mailbox logon verification
        EMSMDB.Connect()
        Task produced output:
        - TaskStarted = 7/8/2013 12:08:31 PM
        - TaskFinished = 7/8/2013 12:08:31 PM
        - Exception = Microsoft.Exchange.Rpc.RpcException: Error 0x5 (Access is denied) from ClientAsyncCallState.CheckCompletion: RpcAsyncCompleteCall
EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
EEInfo: Generation Time: 2013-07-08 19:08:31.121
EEInfo: Generating component: 2
EEInfo: Status: 0x00000005
EEInfo: Detection location: 1710
EEInfo: Flags: 0
EEInfo: NumberOfParameters: 1
EEInfo: prm[0]: Long val: 0 (0x00000000)

EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
EEInfo: Generation Time: 2013-07-08 19:08:31.121
EEInfo: Generating component: 14
EEInfo: Status: 0x00000005
EEInfo: Detection location: 1398
EEInfo: Flags: 0
EEInfo: NumberOfParameters: 2
EEInfo: prm[0]: Long val: 4 (0x00000004)
EEInfo: prm[1]: Long val: -1073606612 (0xC002102C)

EEInfo: ComputerName: n/a
EEInfo: ProcessID: 49060
</ExecutionContext>
      <FailureContext>
      </FailureContext>
      <FailureCategory>-1</FailureCategory>
      <Version>65536</Version>
    </EventXML>
  </UserData>
</Event>

================================================================================


RPC Filtering and Exchange 2010 Database Availability Groups

$
0
0

Recently I’ve had the opportunity to work with customers who were having issues seeding databases using update-mailboxdatabasecopy in Exchange 2010.  When attempting to perform an update the following sample error was returned:

 

A source-side operation failed. Error An error occurred while performing the seed operation. Error: Failed to open a log truncation context to source server 'SOURCE-SERVER'. Hresult: 0xc7ff07d7. Error: Failed to open a log truncation context because the Microsoft Exchange Information Store service is not running.. [Database: MailboxDatabase2, Server: TARGET-SERVER]

*Note that the HResult maybe different in the error even though the root of the issue is the same.

 

In each instance the server we were trying to run the update for was located across a WAN link or separated by firewall devices. 

 

In the reference cases I worked we found that the devices providing the WAN connectivity were performing RPC packet inspection.  For example, Threat Management Gateway has an RPC inspection agent and Cisco devices have a setting to enable DCERPC filtering.  It would appear that certain RPCs that originate from Windows 2008 and Windows 2008 R2 do not conform to the expected format that these filters use.  When a non-conforming packet is identified it is subsequently dropped.

 

We have also observed RPC filtering cause the following issues:

 

  • Continuous replication circular logging fails to trigger log truncation across nodes.
  • Log truncation does not occur in a DAG when a backup is successful on a member that has traffic between nodes subject to RPC filtering.
  • Backup header information for databases does not update on active database copies when a backup is successful on a member that has traffic between nodes subject to RPC filtering.

 

To correct the issue RPC filtering had to be disabled on both the source and target devices providing the WAN connectivity between sites.

Exchange 2010: Remove-databaseavailabilitygroupserver–configurationOnly does not evict the member from the cluster.

$
0
0

Administrators may encounter conditions where DAG members cannot be gracefully removed from a database availability group.  For example, a member server may have encountered an unrecoverable failure or the server may need to be removed from the DAG in order to perform a server recovery.

 

In order to account for these and similar conditions the remove-databaseavailabilitygroupserver –configurationOnly command exists.  This command, when utilized, simply removes the member from the Database Availability Groups Active Directory object.

 

Here is an example…

 

Using get-databaseavailabilitygroup –status | fl name,servers,operationalservers the membership of the DAG can be verified.  In this example the only operational server is MBX-1 since that is the only server currently running in the DAG.

 

[PS] C:\>Get-DatabaseAvailabilityGroup DAG -status | fl name,Servers,OperationalServers


Name               : DAG
Servers            : {MBX-1, MBX-2}
OperationalServers : {MBX-1}

 

Using the remove-databaseavailabilitygroupserver –configurationOnly command a DAG member can be removed.

 

[PS] C:\>Remove-DatabaseAvailabilityGroupServer -Identity DAG -MailboxServer MBX-2 -ConfigurationOnly

Confirm
Are you sure you want to perform this action?
Removing Mailbox server "MBX-2" from database availability group "DAG".
[Y] Yes [A] Yes to All [N] No [L] No to All [?] Help (default is "Y"): a

 

The results of the command can be verified using get-databaseavailabilitygroup –status | fl name,servers,operationalservers:

 

[PS] C:\>Get-DatabaseAvailabilityGroup DAG -status | fl name,Servers,OperationalServers


Name : DAG
Servers : {MBX-1}
OperationalServers : {MBX-1}

 

When a server is removed from the DAG in this manner it is not evicted from the corresponding cluster.  You can verify cluster membership using the built in cluster commands.  Here is an example from this test:

 

( Windows 2008 / Windows 2008 R2 )

 

[PS] C:\>cluster.exe node
Listing status for all available nodes:

Node           Node ID Status
-------------- ------- ---------------------
MBX-1                1 Up
MBX-2                2 Down

( Windows 2008 R2 )

[PS] C:\>Import-Module FailoverClusters

[PS] C:\>Get-ClusterNode

Name                                State
----                                -----
mbx-1                                  Up
mbx-2                                Down

 

In general this issue surfaces when administrators complete a server rebuild operation and note that the rebuilt node cannot be added back to the cluster because it already exists in the cluster.  Here is an example:

 

[PS] C:\>Add-DatabaseAvailabilityGroupServer –identity DAG –mailboxServer MBX-2


WARNING: The operation wasn't successful because an error was encountered. You may find more details in log file
"C:\ExchangeSetupLogs\DagTasks\dagtask_2012-06-24_14-51-47.841_add-databaseavailabiltygroupserver.log".
A server-side database availability group administrative operation failed. Error: The operation failed. CreateCluster errors may result from incorrectly configured static addresses. Error: An error occurred while attempting a cluster operation. Error: Node mbx-2 is already joined to a cluster. [Server: MBX-1.domain.com]
    + CategoryInfo          : InvalidArgument: (:) [Add-DatabaseAvailabilityGroupServer], DagTaskOperationFailedException
    + FullyQualifiedErrorId : D05F37CD,Microsoft.Exchange.Management.SystemConfigurationTasks.AddDatabaseAvailabilityGroupServer

 

 

When using the remove-databaseavailabilitygroupserver –configurationOnly administrators must remove the node from the cluster.  This can be accomplished through two methods:

 

( Windows 2008 / Windows 2008 R2 )

 

Administrators may utilize Failover Cluster Manager.  After connecting to the cluster servicing the Database Availability Group the nodes hive can be expanded.  The administrator can right click on the node that was removed –> select more actions –> evict

 

image

 

( Windows 2008 R2 )


[PS] C:\>Import-Module FailoverClusters

[PS] C:\>Remove-ClusterNode MBX-2

Remove-ClusterNode
Are you sure you want to evict node mbx-2?
[Y] Yes  [N] No  [S] Suspend  [?] Help (default is "Y"): y
Remove-ClusterNode : The cluster node 'MBX-2' was evicted from the cluster, but was not fully cleaned up.  Please see the Failover Clustering application event log on node MBX-2 for more information.
    The RPC server is unavailable
At line:1 char:19
+ Remove-ClusterNode <<<<  MBX-2
    + CategoryInfo          : NotSpecified: (:) [Remove-ClusterNode], ClusterCmdletException
    + FullyQualifiedErrorId : Remove-ClusterNode,Microsoft.FailoverClusters.PowerShell.RemoveClusterNodeCommand

 

(Note:  The RPC error is expected as the command attempts to cleanup the local cluster configuration on the node but the node is not accessible)

 

After cleaning up the cluster configuration the administrator can run set-databaseavailabilitygroup –identity <DAGNAME> to ensure the appropriate cluster configuration is utilized.

Exchange 2013 Cumulative Update 2 – Where are my databases?

$
0
0

In Exchange Server 2013 Cumulative Update 2 (CU2) we introduced support for a maximum of 100 mounted mailbox databases per server (active or passive) when an Enterprise Edition license is applied.  This was an increase from Exchange Server 2013 RTM and Exchange Server 2013 Cumulative Update 1 (CU1).

 

It has been recently discovered that when updating to CU2 from Exchange Server 2013 RTM or Exchange Server 2013 CU1, the database limit is not increased to 100. This results in the inability to mount more than 50 databases on a server running the Enterprise Edition of Exchange Server 2013. When installing Exchange Server 2013 CU2 directly, the database limit is applied correctly.

 

The Core Problem

 

In this example, I deployed a new install of Exchange Server 2013 CU1 on Windows Server 2012 (although, the operating system doesn’t matter here). After the Setup completed, I performing an LDP dump of the information store object, and noted a database limit of 5.

 

Expanding base 'CN=InformationStore,CN=TEST-MBX-0,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=EXCHANGE,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft'...
Getting 1 entries:
Dn: CN=InformationStore,CN=TEST-MBX-0,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=EXCHANGE,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft
adminDisplayName: InformationStore;
cn: InformationStore;
distinguishedName: CN=InformationStore,CN=TEST-MBX-0,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=EXCHANGE,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft;
dSCorePropagationData: 0x0 = (  );
instanceType: 0x4 = ( WRITE );
msExchESEParamCircularLog: 0;
msExchESEParamCommitDefault: 0;
msExchESEParamDbExtensionSize: 256;
msExchESEParamEnableIndexChecking: TRUE;
msExchESEParamEnableOnlineDefrag: TRUE;
msExchESEParamLogFileSize: 5120;
msExchESEParamPageFragment: 8;
msExchESEParamPageTempDBMin: 0;
msExchESEParamZeroDatabaseDuringBackup: 0;
msExchMaxRestoreStorageGroups: 1;
msExchMaxStorageGroups: 5;
msExchMaxStoresPerGroup: 5;
msExchMaxStoresTotal: 5;

name: InformationStore;
objectCategory: CN=ms-Exch-Information-Store,CN=Schema,CN=Configuration,DC=exchange,DC=msft;
objectClass (3): top; container; msExchInformationStore;
objectGUID: 52c2fc98-b8b4-4d33-b35b-ca00cb3fcfff;
showInAdvancedViewOnly: TRUE;
uSNChanged: 24300;
uSNCreated: 24300;
whenChanged: 10/27/2013 3:58:35 PM Pacific Daylight Time;
whenCreated: 10/27/2013 3:58:35 PM Pacific Daylight Time;

-----------

 

This is normal and expected because all unlicensed installations (e.g., Trial) are treated as Standard Edition, and Standard Edition can have a maximum of 5 mounted databases.

 

Next, I entered an Enterprise Edition license key. This updated the value msExchMaxStoresTotal to a value of 50, which is the expected value for the Enterprise Edition of Exchange Server 2013 RTM and Exchange Server 2013 CU1.

 

Expanding base 'CN=InformationStore,CN=TEST-MBX-0,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=EXCHANGE,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft'...
Getting 1 entries:
Dn: CN=InformationStore,CN=TEST-MBX-0,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=EXCHANGE,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft
adminDisplayName: InformationStore;
cn: InformationStore;
distinguishedName: CN=InformationStore,CN=TEST-MBX-0,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=EXCHANGE,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft;
dSCorePropagationData: 0x0 = (  );
instanceType: 0x4 = ( WRITE );
msExchESEParamCircularLog: 0;
msExchESEParamCommitDefault: 0;
msExchESEParamDbExtensionSize: 256;
msExchESEParamEnableIndexChecking: TRUE;
msExchESEParamEnableOnlineDefrag: TRUE;
msExchESEParamLogFileSize: 5120;
msExchESEParamPageFragment: 8;
msExchESEParamPageTempDBMin: 0;
msExchESEParamZeroDatabaseDuringBackup: 0;
msExchMaxRestoreStorageGroups: 1;
msExchMaxStorageGroups: 100;
msExchMaxStoresPerGroup: 5;
msExchMaxStoresTotal: 50;
msExchMinAdminVersion: -2147453113;
msExchVersion: 4535486012416;
name: InformationStore;
objectCategory: CN=ms-Exch-Information-Store,CN=Schema,CN=Configuration,DC=exchange,DC=msft;
objectClass (3): top; container; msExchInformationStore;
objectGUID: 52c2fc98-b8b4-4d33-b35b-ca00cb3fcfff;
showInAdvancedViewOnly: TRUE;
uSNChanged: 27284;
uSNCreated: 24300;
whenChanged: 10/28/2013 4:44:30 AM Pacific Daylight Time;
whenCreated: 10/27/2013 3:58:35 PM Pacific Daylight Time;

-----------

 

Next, I upgraded the server to Exchange Server 2013 CU2. Once Setup completed, I reviewed the value of msExchMaxStoresTotal, and found that it did not get updated as expected.

 

Expanding base 'CN=InformationStore,CN=TEST-MBX-0,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=EXCHANGE,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft'...
Getting 1 entries:
Dn: CN=InformationStore,CN=TEST-MBX-0,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=EXCHANGE,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft
adminDisplayName: InformationStore;
cn: InformationStore;
distinguishedName: CN=InformationStore,CN=TEST-MBX-0,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=EXCHANGE,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft;
dSCorePropagationData: 0x0 = (  );
instanceType: 0x4 = ( WRITE );
msExchESEParamCircularLog: 0;
msExchESEParamCommitDefault: 0;
msExchESEParamDbExtensionSize: 256;
msExchESEParamEnableIndexChecking: TRUE;
msExchESEParamEnableOnlineDefrag: TRUE;
msExchESEParamLogFileSize: 5120;
msExchESEParamPageFragment: 8;
msExchESEParamPageTempDBMin: 0;
msExchESEParamZeroDatabaseDuringBackup: 0;
msExchMaxRestoreStorageGroups: 1;
msExchMaxStorageGroups: 100;
msExchMaxStoresPerGroup: 5;
msExchMaxStoresTotal: 50;
msExchMinAdminVersion: -2147453113;
msExchVersion: 4535486012416;
name: InformationStore;
objectCategory: CN=ms-Exch-Information-Store,CN=Schema,CN=Configuration,DC=exchange,DC=msft;
objectClass (3): top; container; msExchInformationStore;
objectGUID: 52c2fc98-b8b4-4d33-b35b-ca00cb3fcfff;
showInAdvancedViewOnly: TRUE;
uSNChanged: 27284;
uSNCreated: 24300;
whenChanged: 10/28/2013 4:44:30 AM Pacific Daylight Time;
whenCreated: 10/27/2013 3:58:35 PM Pacific Daylight Time;

-----------

 

In this state, when the administrator tries to mount more than 50 databases (any combination of active or passive) on this server, they will receive an error message:

 

[PS] C:\>Mount-Database DB51
Couldn't mount the database that you specified. Specified database: DB51; Error code: An Active Manager operation failed. Error: The database action failed. Error: Operation failed with message: MapiExceptionTooManyMountedDatabases: Unable to mount database. (hr=0x8004060e, ec=-2147219954)

Diagnostic context:
    Lid: 65256
    Lid: 10722   StoreEc: 0x8004060E
    Lid: 1494    ---- Remote Context Beg ----
    Lid: 37952   dwParam: 0x144489A
    Lid: 39576   StoreEc: 0x977
    Lid: 35200   dwParam: 0x39F8
    Lid: 58864   StoreEc: 0x8004060E
    Lid: 43248   StoreEc: 0x8004060E
    Lid: 48432   StoreEc: 0x8004060E
    Lid: 54336   dwParam: 0x14448C8
    Lid: 1750    ---- Remote Context End ----
    Lid: 1047    StoreEc: 0x8004060E [Database: DB51, Server: TEST-MBX-0.exchange.msft].
    + CategoryInfo          : InvalidOperation: (DB51:ADObjectId) [Mount-Database], InvalidOperationException
    + FullyQualifiedErrorId : [Server=TEST-MBX-0,RequestId=797928f4-fe83-4e1f-8545-e01a72aaf79d,TimeStamp=10/28/2013 7
   :53:32 PM] 5FE3BF1B,Microsoft.Exchange.Management.SystemConfigurationTasks.MountDatabase
    + PSComputerName        : test-mbx-0.exchange.msft

 

In addition, three events will be logged to the Application event log:

 

Log Name:      Application
Source:        MSExchangeIS
Date:          10/28/2013 1:00:27 PM
Event ID:      40003
Task Category: High Availability
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      Test-MBX-0.exchange.msft
Description:
Exceeded the max number of 50 databases on this server.

Log Name:      Application
Source:        MSExchangeRepl
Date:          10/28/2013 1:00:27 PM
Event ID:      3154
Task Category: Service
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      Test-MBX-0.exchange.msft
Description:
Active Manager failed to mount database DB51 on server Test-MBX-0.exchange.msft. Error: An Active Manager operation failed. Error: The database action failed. Error: Operation failed with message: MapiExceptionTooManyMountedDatabases: Unable to mount database. (hr=0x8004060e, ec=-2147219954)
Diagnostic context:
    Lid: 65256 
    Lid: 10722   StoreEc: 0x8004060E
    Lid: 1494    ---- Remote Context Beg ----
    Lid: 37952   dwParam: 0x14A9D25
    Lid: 39576   StoreEc: 0x977    
    Lid: 58864   StoreEc: 0x8004060E
    Lid: 43248   StoreEc: 0x8004060E
    Lid: 48432   StoreEc: 0x8004060E
    Lid: 54336   dwParam: 0x14A9D54
    Lid: 1750    ---- Remote Context End ----
    Lid: 1047    StoreEc: 0x8004060E

Log Name:      Application
Source:        ExchangeStoreDB
Date:          10/28/2013 1:00:27 PM
Event ID:      226
Task Category: Database recovery
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      Test-MBX-0.exchange.msft
Description:
At '10/28/2013 1:00:27 PM' the copy of database 'DB51' on this server didn't mount because the number of mailbox database copies on this server exceeds the supported limit. The error returned by failover was "There is only one copy of this mailbox database (DB51). Automatic recovery is not available.".

 

Similar errors and events will occur when attempting to activate a passive database copy on a server with 50 mounted databases or when adding a passive database copy to a server that already has 50 database copies assigned.

 

How to Correct this Condition

 

Fortunately, fixing this problem is pretty easy. You simply re-enter your Enterprise Edition product key on the server, and restart the Microsoft Exchange Information Store service.

 

Moving Forward

 

For the immediate future, we expect this problem to exist when upgrading from Exchange Server 2013 CU1 to any future CU. However, once the server is re-licensed on CU2 or newer, the condition will remain resolved.

Office365: POP and IMAP clients receive OWA links for calendar invitations

$
0
0

As you may know, Office 365 supports a number of different client protocols, including POP and IMAP and a variety of POP and IMAP clients. By default, POP or IMAP clients will be configured to use Outlook Web App (OWA) for handling calendar invitations. When these clients receive a meeting request, within the body of the invite is a link. Clicking on the link allows the user to open their mailbox via OWA so they can accept or decline the request.

 

If you are using a POP or IMAP client that is capable of handling ICAL messages, you may want to change your configuration settings so that you can get a better experience.

 

You can configure – on a per-mailbox basis – how POP and IMAP clients receive calendar appointments.  Get-CASMailbox can be used to view your current settings.

 

Get-CASMAILBOX –identity <NAME> | fl name,*pop*,*imap*

Name                                    : administrator
ExternalPopSettings                     :
InternalPopSettings                     :
PopEnabled                              : True
PopUseProtocolDefaults                  : True
PopMessagesRetrievalMimeFormat          : BestBodyFormat
PopEnableExactRFC822Size                : False
PopSuppressReadReceipt                  : False
PopForceICalForCalendarRetrievalOption  : False
ExternalImapSettings                    :
InternalImapSettings                    :
ImapEnabled                             : True
ImapUseProtocolDefaults                 : True
ImapMessagesRetrievalMimeFormat         : BestBodyFormat
ImapEnableExactRFC822Size               : False
ImapSuppressReadReceipt                 : False
ImapForceICalForCalendarRetrievalOption : False

In the above output, you can see two attributes that you need to set to True to enable iCAL support. The *ICALForCalendarRetrievalOption specifies that the client should be provided calendar appointments that are in ICAL format.

 

Also note the *UseProtocolDefaults attributes.  These must be set to False in order for any changes to *ICALForCalendarRetrievalOption to take effect.  You can use Set-CASMailbox to change these settings: 

 

Set-CASMailbox –identity <NAME> –PopUseProtocolDefaults:$FALSE –ImapUseProtocolDefaults:$FALSE –PopForceICalForCalendarRetrievalOption:$TRUE –ImapForceICalForCalendarRetrievalOption:$TRUE

 

Name                                    : administrator
ExternalPopSettings                     :
InternalPopSettings                     :
PopEnabled                              : True
PopUseProtocolDefaults                  : False
PopMessagesRetrievalMimeFormat          : BestBodyFormat
PopEnableExactRFC822Size                : False
PopSuppressReadReceipt                  : False
PopForceICalForCalendarRetrievalOption  : True
ExternalImapSettings                    :
InternalImapSettings                    :
ImapEnabled                             : True
ImapUseProtocolDefaults                 : False
ImapMessagesRetrievalMimeFormat         : BestBodyFormat
ImapEnableExactRFC822Size               : False
ImapSuppressReadReceipt                 : False
ImapForceICalForCalendarRetrievalOption : True

The settings described here are per-mailbox settings.  There is no global setting to change the default for the entire tenant. There is also no method to adjust the settings per-client.

Exchange 2013: Health Manager service may not reliably start after server boot.

$
0
0

In Exchange 2013 we have introduced managed availability.  The managed availability process runs within the Microsoft Exchange Health Manager service (MSExchangeHMHost.exe).

 

After booting a server it has been observed that the Microsoft Exchange Health Manager service fails to start automatically.  Attempts to start the service manually are successful.  When reviewing the system log the following events are present:

 

Log Name:      System
Source:        Service Control Manager
Date:          11/13/2013 9:23:47 AM
Event ID:      7009
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBX-1.domain.com
Description:
A timeout was reached (30000 milliseconds) while waiting for the Microsoft Exchange Health Manager service to connect.

Log Name:      System
Source:        Service Control Manager
Date:          11/13/2013 9:23:47 AM
Event ID:      7000
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBX-1.domain.com
Description:
The Microsoft Exchange Health Manager service failed to start due to the following error:
The service did not respond to the start or control request in a timely fashion.

 

To correct the issue the startup type for the service can be changed from AUTOMATIC to AUTOMATIC (DELAYED). 

 

image

 

After setting the startup type of the service to AUTOMATIC (DELAYED) the following event is noted in the system log on boot:

 

Log Name:      System
Source:        Service Control Manager
Date:          11/13/2013 10:13:26 AM
Event ID:      7036
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-1.domain.com
Description:
The Microsoft Exchange Health Manager service entered the running state.

 

This event indicates the Microsoft Exchange Health Manager service started successfully.

Viewing all 161 articles
Browse latest View live




Latest Images