Dell MD Array Impending Physical Disk Failure (High Data Availability Risk)

Dell.MDStorageArray.ABBXMLEvent24 (Rule)

Knowledge Base article:

Summary

Impending Physical Disk Failure (High Data Availability Risk)

The causes and resolutions refer to the Dell Modular Disk Storage Manager recovery guru. Launch Dell Modular Disk Storage Manager to diagnose and fix the recovery failure as follows:

Causes

A physical disk is reporting internal errors that could cause the physical disk to fail. The Recovery Guru Details area provides specific information you will need as you follow the recovery steps.

Caution: Risk of Data Loss. This problem needs to be resolved immediately. Data loss will occur if the indicated physical disk fails before you follow these recovery steps.

Caution: Electrostatic discharge can damage sensitive components. Always use proper antistatic protection when handling components. Touching the components without using a proper ground may damage the equipment.

Important Notes

If the current status/RAID level of the virtual disks is...

Then go to...

Optimal RAID 0

'Recovering RAID 0'

Degraded RAID 1, 5, 6, or 10

'Recovering Degraded Virtual Disks'

RAID 1, 5, 6, or 10 with a hot spare physical disk currently being reconstructed

'Recovering with a Reconstructing Hot Spare'

Recovering RAID 0

Use the following procedure if the affected virtual disks are RAID 0.

Resolutions

1

Stop all I/O to the affected virtual disks.

2

Back up all data on the affected virtual disks (step 5 will destroy all data on the affected virtual disks).

Note: To the Operating System (OS), a failed virtual disk is exactly the same as a failed non-RAID physical disk. Refer to the OS documentation for any special requirements concerning failed physical disks and perform them where necessary.

3

If...

Then...

Any of the affected virtual disks are also source or target virtual disks in a copy operation that is either Pending or In Progress

Perform the following steps to stop the copy operation:

a

Go to the Copy Manager by selecting the Copy Services > Virtual Disk Copy > Manage Copies... menu option in the Array Management Window (AMW).

b

Highlight each copy pair that contains an affected virtual disk.

c

Select the Copy > Stop menu option.

d

Go to step 4.

Any of the affected virtual disks are NOT a source or target virtual disks in a copy operation that is either Pending or In Progress

Go to step 4.

4

If you have snapshot (legacy) virtual disks associated with the affected virtual disks, these snapshot virtual disks will no longer be valid once you fail the physical disk in step 5.

Perform any necessary operations (such as backup) on the snapshot virtual disks and then delete them.

5

Caution: Risk of Data Loss. The data on the affected virtual disks will be lost once you perform this step. Be sure you have backed up your data before performing this step.

Highlight the affected physical disk on the Hardware tab in the AMW and select the Hardware > Physical Disk > Advanced > Fail... menu option. The affected virtual disks become Failed .

6

Remove the failed physical disk (its fault indicator light should be on).

7

Wait 30 seconds, then insert the new physical disk. Its fault indicator light may be lit for a short time (one minute or less).

Note: Wait until the new physical disk is ready (its fault indicator light must be off) before attempting to initialize the virtual disks in step 8.

8

Highlight the disk group associated with the replaced physical disk on the Storage and Copy Services tab in the AMW and select the Storage > Disk Group > Advanced > Initialize... menu option.

  • The virtual disks in the disk group are initialized, one at a time.

  • To monitor initialization progress for a virtual disk, on the Storage and Copy Services tab in the AMW, highlight the virtual disk and view the progress in the Properties pane. Note that once the operation in progress has completed, the progress bar is no longer displayed in the Properties dialog.

  • When initialization is completed, all virtual disks in the disk group are Optimal.

Note: Make sure you save this procedure by selecting Save As because once you perform step 9 and the failure is fixed, you will not be able to access the information in steps 10 and 11 from the Recovery Guru.

9

Click the Recheck button to rerun the Recovery Guru. The failure should no longer appear in the Summary area. If the failure appears again, contact your Technical Support Representative, otherwise, go to step 10.

10

Add the affected virtual disks back to the operating system. You may need to reboot the system to see the re-initialized virtual disks.

Note: Do not start I/O to these virtual disks until after you restore from backup.

11

Restore the data for the affected virtual disks from backup.

12

If desired, create any snapshots that you deleted in step 4.

13

If desired, re-create any copies you stopped by highlighting the copy pairs in the Copy Manager and selecting the Copy > Re-Copy menu option.

Recovering Degraded Virtual Disks

Use the following procedure if the affected virtual disks are degraded RAID 1, 5, 6, or 10. This procedure applies to both disk groups and disk pools. You will need two replacement physical disks for this procedure.

Caution: Risk of Data Loss. An impending physical disk failure means that the affected physical disk is likely to fail. If it fails while you are replacing the physical disk that has already failed on this disk pool or disk group (see steps 3 and 4 below), you will lose all data on the affected virtual disks.

1

Although it is not required, you should stop all I/O to the affected virtual disks to reduce the possibility of data loss.

2

Although it is not required, you should back up all data on the affected virtual disks.

3

Remove the failed physical disk. The fault indicator light for the physical disk should be on.

Note: The Service Action Allowed status in the Details area is always NO for this problem because the component is not yet failed. In this situation, it is acceptable to remove the component even though the Service Action Allowed is NO.

4

Wait 30 seconds, then insert the new physical disk.

  • Data reconstruction should begin on the new physical disk (the activity indicator lights of the other physical disks in the disk pool or disk group will start blinking).

  • The icon for each virtual disk in the affected disk pool or disk group changes to Optimal after the virtual disk is reconstructed.

  • To monitor the progress of reconstruction on the affected virtual disks or change the reconstruction rate, on the Storage and Copy Services tab in the AMW, highlight the virtual disk and view the progress in the Properties pane. Note that once the operation in progress has completed, the progress bar is no longer displayed in the Properties pane.

  • If the physical disk with the Impending Physical Disk Failure fails while data is being reconstructed on the replaced physical disk, all affected virtual disks will fail, and all data will be lost. If this occurs, click the Recheck button to Recovery Guru and fix the errors reported.

5

Wait until all affected virtual disks have returned to an Optimal status. Resume I/O to the affected virtual disks, if you stopped it in step 1.

6

Highlight the impending physical disk failure physical disk on the Hardware tab in the Array Management Window and select the Hardware > Physical Disk > Advanced > Fail... menu option. The virtual disks in the disk pool or disk group return to a Degraded state.

7

Remove the failed physical disk (its fault indicator light should be on).

Note: The Service Action (removal) Allowed (SAA) status in the Details area is always NO for this problem because the component is not yet failed. In this situation, it is acceptable to remove the component even though the SAA is NO.

8

Wait 30 seconds, then insert the new physical disk.

9

Click the Recheck button to rerun the Recovery Guru. The failure should no longer appear in the Summary area. If the failure appears again, contact your Technical Support Representative.

Recovering with a Reconstructing Hot Spare

Use the following procedure if all of the following conditions apply:

Caution: Risk of Data Loss. An Impending Physical Disk Failure means that the affected physical disk is likely to fail. If it fails while the hot spare is reconstructing, you will lose all data on the affected virtual disks. For this reason, you should stop all I/O to the affected virtual disks and back up all data on the affected virtual disks before replacing the physical disks.

1

Although it is not required, you should stop all I/O to the affected virtual disks to reduce the possibility of data loss.

2

Although it is not required, you should back up all data on the affected virtual disks.

3

Wait for the hot spare physical disk to finish reconstructing.

  • All virtual disk icons will be Optimal when reconstruction is completed.

  • If the physical disk with the Impending Physical Disk Failure status fails while the hot spare is being reconstructed, the affected virtual disks will fail, and all data is lost. If this occurs, rerun the Recovery Guru and fix the errors reported, starting with the physical disk with the Impending Physical Disk Failure.

4

Click the Recheck button to rerun the Recovery Guru. The failure should no longer appear in the Summary area. If the failure appears again, contact your Technical Support Representative.

Element properties:

TargetMicrosoft.SystemCenter.ManagementServer
CategoryAlert
EnabledTrue
Alert GenerateTrue
Alert SeverityWarning
Alert PriorityNormal
RemotableTrue
Alert Message
Dell MD Array Impending Physical Disk Failure (High Data Availability Risk)
{0}

Member Modules:

ID Module Type TypeId RunAs 
DS DataSource Microsoft.Windows.ScriptGenerated.EventProvider Default
Alert WriteAction System.Health.GenerateAlert Default
WriteToDW WriteAction Microsoft.SystemCenter.DataWarehouse.PublishEventData Default

Source Code:

<Rule ID="Dell.MDStorageArray.ABBXMLEvent24" Enabled="onEssentialMonitoring" Target="SystemCenter!Microsoft.SystemCenter.ManagementServer" ConfirmDelivery="true" Remotable="true" Priority="Normal" DiscardLevel="100">
<Category>Alert</Category>
<DataSources>
<DataSource ID="DS" TypeID="Windows!Microsoft.Windows.ScriptGenerated.EventProvider">
<ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$</ComputerName>
<ScriptName>RBODEventGenerator</ScriptName>
<EventNumber>24</EventNumber>
</DataSource>
</DataSources>
<WriteActions>
<WriteAction ID="Alert" TypeID="SystemHealth!System.Health.GenerateAlert">
<Priority>1</Priority>
<Severity>1</Severity>
<AlertMessageId>$MPElement[Name="Dell.MDStorageArray.ABBXMLEvent24.StringResource"]$</AlertMessageId>
<AlertParameters>
<AlertParameter1>$Data/EventDescription$</AlertParameter1>
</AlertParameters>
<Suppression>
<SuppressionValue>$Data/EventDisplayNumber$</SuppressionValue>
<SuppressionValue>$Data/Channel$</SuppressionValue>
<SuppressionValue>$Data/PublisherName$</SuppressionValue>
<SuppressionValue>$Data/LoggingComputer$</SuppressionValue>
<SuppressionValue>$Data/EventCategory$</SuppressionValue>
<SuppressionValue>$Data/EventLevel$</SuppressionValue>
<SuppressionValue>$Data/UserName$</SuppressionValue>
<SuppressionValue>$Data/EventNumber$</SuppressionValue>
<SuppressionValue>$Data/EventDescription$</SuppressionValue>
</Suppression>
<Custom1/>
<Custom2/>
<Custom3/>
<Custom4/>
<Custom5/>
<Custom6/>
<Custom7/>
<Custom8/>
<Custom9/>
<Custom10/>
</WriteAction>
<WriteAction ID="WriteToDW" TypeID="SCDW!Microsoft.SystemCenter.DataWarehouse.PublishEventData"/>
</WriteActions>
</Rule>