When a disk fails, MapR raises the node-level alarm
NODE_ALARM_DISK_FAILURE on the node with the failed disk (or disks). At the same time, other disks in the same storage pool as the failed disk are taken offline. You can look at the MapR Control System (MCS) and click on Cluster>Dashboard to see a cluster heatmap of each node and a list of alarms, similar to this:
By hovering your mouse over the , you can get more information about the reason for the failure. By clicking on the , you can display node-specific information including an alarm summary like the one below:
When you see a disk failure alarm, examine the log file at
/opt/mapr/logs/faileddisk.log and check the Failure Reason field.
Examining the Cause of Failure
faileddisk.log file, you will see information on the cause of failure. In the sample log output below, the failure reason is I/O error. Notice that the log file also provides instructions for removing disks and adding them back to MapR-FS.
Recovering from Failures
Most software failures can be remedied by running the
fsck utility, which scans the storage pool that the disk belongs to and reports errors. For hardware failures, remove the failed disk and replace it according to the procedure in Removing and Replacing Disks.
The following table lists types of failures and recommended courses of action:
|Failure Reason||Recommended Course of Action|
|I/O time out||Increase the value of the |
|No such device||Check if the disk has been renamed. If so, re-add this disk by running the |
|I/O error||Test the drive for possible causes.|
|CRC error||Run |
|Slow disk||Test the drive for possible causes.|
GUID of disk mismatches with the one in
It's possible that disk names have changed.
|After a node restart, the operating system can reassign the drive labels (for example, |
|Unknown error||Contact MapR support.|
Addressing Data Alarms
When a disk fails, data on that disk becomes unavailable. As a result, you will probably see one of these two data alarms along with a Disk Failure alarm:
- Data Unavailable (
VOLUME_ALARM_DATA_UNAVAILABLE) - if there was only one copy of data and it was on the failed disk; or if data was replicated more than once, but all disks with that data failed
- Data Under Replicated (
VOLUME_ALARM_DATA_UNDER_REPLICATED) - if data on the failed disk is replicated elsewhere, but the minimum replication factor is not met as a result of the failed disk
If you see a Data Unavailable volume alarm in the cluster, follow these steps to run the
/opt/mapr/server/fsck utility on all the offline storage pools. On each node in the cluster that has raised a disk failure alarm:
Run the following command to identify which storage pools are offline:
[user@host] /opt/mapr/server/mrconfig sp list | grep Offline
For each storage pool reported by the previous command, run the following command, where
<sp>specifies the name of an offline storage pool:
[user@host] /opt/mapr/server/fsck -n <sp> -r
When you run
fsckutility identifies corrupt blocks and removes them. If there are no corrupt blocks,
fsckclears the error condition so you can bring the storage pool back online.
Verify that all Data Unavailable volume alarms are cleared. If Data Unavailable volume alarms persist, contact MapR support or post on answers.mapr.com.
If there are any Data Under Replicated volume alarms in the cluster, MapR can repair the problem by automatically replicating data and putting it on another disk. After you allow a reasonable amount of time for re-replication, verify that the under-replication alarms are cleared.
/opt/mapr/server/fsck utility with the
-r option produces different results depending on the scenario. The
fsck utility does not interpret the scenario nor does it have a safe mode.
- If a disk is offline because of an imbalanced b-tree, using
fsck -rmay result in data loss from bad containers and data loss if additional replicas are unavailable.
- If a disk is offline because of an I/O error, using
fsck -rproduces indeterminate results. A disk that is throwing I/O errors is questionable in terms of data content and reliability. For example, an operation that completed on the disk but was never returned may have partial data remaining on the disk. Using
fsck -rretains any partial data.
- If a disk is offline because of a slow I/O, using
fsck -rdoes not produce data loss.
The most conservative usage of
fsck -r is to run
fsck without the
-r option (verification mode) and check the output. If the output is ok, then run
fsck with the
Removing and Replacing Disks
If a disk fails due to a hardware problem, you will need to remove the disk. You can replace it, and then add that disk back to MapR-FS along with the other disks that were automatically removed at the same time.
To remove and replace disks using the MapR command-line interface:
- On the node with failed disks, determine which disk to replace by examining Disk entries in the
In the sample log file above, the failed disk is
disk removecommand to remove the disk. Run the following command, substituting the hostname or IP address for
<host>and a list of failed disks for
The disk removal process can take several minutes to complete.
Examine the screen output in response to this command:Note the additional disks removed when
/dev/sddis removed. The disks
/dev/sdfare part of the same storage pool and therefore removed along with the failed disk.
- Confirm that removed disks do not appear in the
- Replace the failed disks on the node or nodes, following correct procedures for your hardware.
Remove the failed disk log file from the
/opt/mapr/logsdirectory. These log files are typically named in the pattern
disk addcommand to add the replacement disk (or disks) along with other disks from the same storage pool or pools.
For example, to add the removed disks shown in step 3, the command is:
Note that when the failed disk has been replaced, it should be added back to MapR-FS along with the other disks from the same storage pool that were previously removed. If you add only the replacement disk to MapR-FS, this will result in a non-optimal storage pool layout which can lead to degraded performance.
Once the disks are added to MapR-FS, the cluster allocates properly-sized storage pools automatically. For example, if you add ten disks, MapR allocates two storage pools of three disks each and two storage pools of two disks each.
To remove and replace disks using the MapR Control System:
- Identify the failed disk or disks:
- Remove the failed disk or disks from MapR-FS:
- In the
- Click the Remove Disk(s) to MapR-FS button.
- In the dialog box that opens, click OK.
- Wait several minutes while the removal process completes. After you remove the disks, the offline disks from the same storage pools are marked as available (not in use by MapR).
- From a command line terminal, remove the failed disk log file from the
/opt/mapr/logsdirectory. These log files are typically named like this:
- In the
- Replace the failed disks on the node or nodes according to the correct hardware procedure.
- Add the replacement and available disks to MapR-FS:
- In the Navigation pane, expand the Cluster group and click the Nodes view.
- Click the name of the node on which you replaced the disks.
- In the MapR-FS and Available Disks pane, click the checkboxes beside the disks you want to add to the storage pool.
- Click the Add Disk(s) to MapR-FS button.
- When the confirmation dialog box appears, click OK.
- Note that the display shows MapR-FS under the File System column, which indicates that the disks were successfully added.