RAID Consistency Check
RAID controllers provide a variety of options for testing the consistency of a RAID set. The objective of such tests is the early detection of parity and block errors.
Generally, all of the associated blocks on the affected hard disks will be read. If individual read errors (bad blocks) occur during this testing process (and sufficient redundant data is available), these blocks can be re-written with the correct data (such as by means of 3ware’s Dynamic Sector Repair feature). When re-writing the data from the affected data block on the hard disk, the hard disk’s firmware will replace the erroneous sector with another free sector (reserve sector). Additional information about this can be found on Wikipedia at http://en.wikipedia.org/wiki/Bad_sector
3ware RAID Controller
3ware provides the verify feature for testing RAID arrays. If this feature is executed, the performance of the RAID array will be affected by the test. The affect on performance can be configured by varying the Background Task Rate setting.
- 3ware recommends running a verify process at least weekly.
- If a RAID unit has not yet been initialized then the initialization will automatically be executed instead of the test when starting the verify feature. After the initialization process has completed, the verify feature must be re-started. The initialization status of a RAID unit can be determined by issuing the
tw_cli /c0/u0 show all
command from the command line interface (CLI). - Additional, detailed information can be found in the RAID Controller User Guide from 3ware in the Sections, About Initialization and About Verification.
Example:
[root@testserver ~]# tw_cli /c0/u0 start verify Sending start verify message to /c0/u0 ... Done. [root@testserver ~]# tw_cli /c0/u0 show all /c0/u0 status = VERIFYING /c0/u0 is not rebuilding, its current state is VERIFYING /c0/u0 is verifying with percent completion = 1 /c0/u0 is initialized. /c0/u0 volume(s) = 1 /c0/u0 name = /c0/u0 serial number = 5ND29QM781616400022D /c0/u0 Storsave Policy = protect /c0/u0 Command Queuing Policy = off Unit UnitType Status %Cmpl Port Stripe Size(GB) Blocks ----------------------------------------------------------------------- u0 RAID-1 VERIFYING 1 - - 232.82 488259584 u0-0 DISK OK - p1 - 232.82 488259584 u0-1 DISK OK - p0 - 232.82 488259584 Parameter index does not exist [root@testserver ~]#
Areca RAID Controller
The Consistency Check feature tests the consistency of Areca RAID arrays. The feature can check RAID3, RAID5 and RAID6 level RAID sets. In doing so, all associated data blocks will be read, parity will be calculated, the stored parity read, and finally the calculated and stored parity values will be compared. Naturally, the performance of the RAID set will be affected by this process.
Areca also recommends performing this type of check at least weekly.
The user manual for the Areca RAID controller contains detailed information in the Consistency Check section as well in the CLI user guide.
- User Manual ARC-11XX/ARC-12XX
- List of additional user manuals for additional Areca controllers with links
- CLI User Guide
Example:
[root@testserver ~]# cli64 vsf check vol=1
Consistency checks that in are process can also be interrupted (such as during performance bottlenecks).
[root@testserver ~]# cli64 vsf stopcheck
Practical Experience
RAID tests were performed using an Intel SR2500 server (BIOS: 94, BMC: 64, FRUSDER: 47) using the Areca ARC-1210-4x SATA controller (BIOS: v1.21, Firmware: 1.46), including a battery backup module with three hard disks (ST3200826AS). A Level 5 RAID volume with 20 gigabytes was tested. The process took five minutes and twenty-five seconds, which corresponds to a test throughput of roughly sixty-three megabytes per second.
Adaptec RAID Controller
With the Adaptec controller, the verify feature tests the disk media.
There are the following logical drive options according to the Adaptec CLI Users Guide v 5.20:
- verify_fix (Verify with fix) - verifies the disk media and repairs the disk if bad data is found
- verify - verifies the disk media
Example:
[root@testserver root]# ./arcconf TASK Usage: TASK START <Controller#> LOGICALDRIVE <LogicalDrive#> <task> [noprompt] Usage: TASK STOP <Controller#> LOGICALDRIVE <LogicalDrive#> Usage: TASK START <Controller#> DEVICE <Channel# ID#> <task> [noprompt] Usage: TASK STOP <Controller#> DEVICE <Channel# ID#> ====================================================== Performs a task on a logical or physical device Task : Task to be started or performed. LogicalDrive# : logical device ID on which task is to be performed Logical Tasks : verify_fix (Verify with fix) verify clear Channel# ID# : The Channel and ID of the physical device on which task is to be performed. Optionally ALL indicates all ready drives for initialize task only (ex. ARCCONF TASK START 1 DEVICE ALL INITIALIZE). Physical Tasks : verify_fix verify clear initialize secureerase [root@testserver root]# ./arcconf TASK START 1 LOGICALDRIVE 0 verify Controllers found: 1 Verify of a Logical Device is a long process. Are you sure you want to continue? Press y, then ENTER to continue or press ENTER to abort: y Command completed successfully. [root@testserver root]# ./arcconf GETSTATUS 1 Controllers found: 1 Logical device Task: Logical device : 0 Task ID : 101 Current operation : Verify Status : In Progress Priority : High Percentage complete : 1 Command completed successfully. [root@testserver root]#
The following document contains additional, detailed information:
Service and Maintenance of Adaptec RAID Solutions
Hard Disks without RAID Controller
Tests for block errors can be performed on hard disks that are managed directly by the main board (without a RAID controller) with the help of SMART tools. For example, the smartmontools can be used under Linux. Wikipedia provides a comparison of additional SMART Tools.
The following example is the result of a test using smartmontools under Linux.
[root@testserver ~]# smartctl -t long /dev/sda smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 115 minutes for test to complete. Test will complete after Fri Apr 11 12:34:29 2008 Use smartctl -X to abort test. [root@testserver ~]#
After the test has been performed, the result can be requested.
[root@testserver ~]# smartctl -l selftest /dev/sda smartctl version 5.38 [i386-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 112 - [root@testserver ~]#
If there are errors, correction does not require much effort. Additional information about this can be found at the Smartmontools Project: Bad Block HowTo (see also Analysis of a hard disk with bad blocks using smartctl).
Note: With some SATA hard disks, executing smartctl
may also require one of the following flags: -d ata
or -d sat
(regarding this, see also http://smartmontools.sourceforge.net/#testinghelp).
Author: Werner Fischer Werner Fischer, working in the Knowledge Transfer team at Thomas-Krenn, completed his studies of Computer and Media Security at FH Hagenberg in Austria. He is a regular speaker at many conferences like LinuxTag, OSMC, OSDC, LinuxCon, and author for various IT magazines. In his spare time he enjoys playing the piano and training for a good result at the annual Linz marathon relay.
|