Linux Software RAID

From Thomas-Krenn-Wiki
Jump to: navigation, search

Linux Software RAID (often called mdraid or MD/RAID) makes the use of RAID possible without a hardware RAID controller. For this purpose, the storage media used for this (hard disks, SSDs and so forth) are simply connected to the computer as individual drives, somewhat like the direct SATA ports on the motherboard.

In contrast with software RAID, hardware RAID controllers generally have a built-in cache (often 512 MB or 1 GB), which can be protected by a BBU or ZMCP. With both hardware and software RAID arrays, it would be a good idea to deactivate write caches for hard disks, in order to avoid data loss during power failures. SSDs with integrated condensers, which write the contents of the cache to the FLASH PROM during power failures, are the exception to this (such as the Intel 320 Series SSDs).

Functional Approach

Example: A Linux software RAID array with two RAID 1 devices (one for the root file system, the other for swapping.
A Linux software RAID array will support the following RAID levels:[1]
  • RAID 0
  • RAID 1
  • RAID 4
  • RAID 5
  • RAID 6[2]
  • RAID 10

RAID Superblock

A Linux software RAID array will store all of the necessary information about a RAID array in a superblock. This information will be found in different positions depending the metadata version.

Superblock Metadata Version 0.90

The 0.90 version superblock is 4,096 bytes long and located in a 64 KiB-aligned block at the end of the device. Depending on the device size, the superblock can first start at 128 KiB before the end of the device or 64 KiB before the end of the device at the latest. To calculate the address of the superblock, the device size must be rounded down to the nearest 64 KiB and then 64 KiB deducted from the result.[3]

Version 0.90 Metadata Limitations:

  • 28 devices maximum in one array
  • each device may be a maximum of 2 TiB in size
  • No support for bad-block-management[4]

Superblock Metadata Version 1.*

The position of the superblock depends on the version of the metadata:[5]

  • Version 1.0: The superblock is located at the end of the device.
  • Version 1.1: The superblock is located at the beginning of the device.
  • Version 1.2: The superblock is 4 KiB after the beginning of the device.

Creating a RAID Array

The following example will show the creation of a RAID 1 array. A Fedora 15 live system will be used in the example.

Preparing Partitions

The software RAID array will span across /dev/sda1 and /dev/sdb1. These partitions will have the Linux raid autodetect type (fd):

[root@localhost ~]# fdisk -l /dev/sda

Disk /dev/sda: 120.0 GB, 120034123776 bytes
139 heads, 49 sectors/track, 34421 cylinders, total 234441648 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x2d0f2eb3

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048    20973567    10485760   fd  Linux raid autodetect
[root@localhost ~]# fdisk -l /dev/sdb

Disk /dev/sdb: 120.0 GB, 120034123776 bytes
139 heads, 49 sectors/track, 34421 cylinders, total 234441648 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xe69ef1f5

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048    20973567    10485760   fd  Linux raid autodetect
[root@localhost ~]#

Creating a RAID 1

[root@localhost ~]# mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
[root@localhost ~]#

The progress of the initialization process can be requested through the proc file system or mdadm:

[root@localhost ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb1[1] sda1[0]
      10484664 blocks super 1.2 [2/2] [UU]
      [========>............]  resync = 42.3% (4440832/10484664) finish=0.4min speed=201856K/sec
      
unused devices: <none>
[root@localhost ~]#
[root@localhost ~]# mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Tue Jul 26 07:49:50 2011
     Raid Level : raid1
     Array Size : 10484664 (10.00 GiB 10.74 GB)
  Used Dev Size : 10484664 (10.00 GiB 10.74 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Tue Jul 26 07:50:23 2011
          State : active, resyncing
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

 Rebuild Status : 62% complete

           Name : localhost.localdomain:0  (local to host localhost.localdomain)
           UUID : 3a8605c3:bf0bc5b3:823c9212:7b935117
         Events : 11

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
[root@localhost ~]# 

Testing the Alignment

The version 1.2 metadata will be used in the example. The metadata is thus close to the beginning of the device with the actual data after it, however aligned at the 1 MiB boundary (Data offset: 2048 sectors, a sector has 512 bytes):

[root@localhost ~]# mdadm -E /dev/sda1
/dev/sda1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3a8605c3:bf0bc5b3:823c9212:7b935117
           Name : localhost.localdomain:0  (local to host localhost.localdomain)
  Creation Time : Tue Jul 26 07:49:50 2011
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 20969472 (10.00 GiB 10.74 GB)
     Array Size : 20969328 (10.00 GiB 10.74 GB)
  Used Dev Size : 20969328 (10.00 GiB 10.74 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 10384215:18a75991:4f09b97b:1960b8cd

    Update Time : Tue Jul 26 07:50:43 2011
       Checksum : ea435554 - correct
         Events : 18


   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing)
[root@localhost ~]# 

Depending on the version of mdadm the size of the data offset varies:

  • Note: mdadm's current development version allows to specify the size of the data offset manually (for --create, --grow, not for --add): Add --data-offset flag for Create and Grow
  • since mdadm-3.2.5: 128 MiB Data Offset (262144 sectors), if possible: super1: fix choice of data_offset. (14.05.2012): While it is nice to set a high data_offset to leave plenty of head room it is much more important to leave enough space to allow of the data of the array. So after we check that sb->size is still available, only reduce the 'reserved', don't increase it. This fixes a bug where --adding a spare fails because it does not have enough space in it.
  • since mdadm-3.2.4: 128 MiB Data Offset (262144 sectors) super1: leave more space in front of data by default. (04.04.2012): The kernel is growing the ability to avoid the need for a backup file during reshape by being able to change the data offset. For this to be useful we need plenty of free space before the data so the data offset can be reduced. So for v1.1 and v1.2 metadata make the default data_offset much larger. Aim for 128Meg, but keep a power of 2 and don't use more than 0.1% of each device. Don't change v1.0 as that is used when the data_offset is required to be zero.
  • since mdadm-3.1.2: 1 MiB Data Offset (2048 sectors) super1: encourage data alignment on 1Meg boundary (03.03.2010): For 1.1 and 1.2 metadata where data_offset is not zero, it is important to align the data_offset to underlying block size. We don't currently have access to the particular device in avail_size so just try to force to a 1Meg boundary. Also default 1.x metadata to 1.2 as documented. (see also Re: Mixing mdadm versions)

Adjusting the Sync Rate

A RAID volume can be used immediately after creation, even during synchronization. However, this reduces the rate of synchronization.

In this example directly accessing a RAID 1 array spanning two SSDs (without partitions on /dev/sda and /dev/sdb), synchronization starts at roughly 200 MB/s and drops to 2.5 MB/s as soon as data has been written to the RAID 1 array’s file system:

[root@localhost ~]# dd if=/dev/urandom of=/mnt/testfile-1-1G bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 115.365 s, 9.3 MB/s
[root@localhost ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb[1] sda[0]
      117219728 blocks super 1.2 [2/2] [UU]
      [============>........]  resync = 63.3% (74208384/117219728) finish=279.5min speed=2564K/sec
      
unused devices: <none>
[root@localhost ~]#

The synchronization can be accelerated by manually increasing the sync rate:[6]

[root@localhost ~]# cat /proc/sys/dev/raid/speed_limit_max
200000
[root@localhost ~]# cat /proc/sys/dev/raid/speed_limit_min 
1000
[root@localhost ~]# echo 100000 > /proc/sys/dev/raid/speed_limit_min 
[root@localhost ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb[1] sda[0]
      117219728 blocks super 1.2 [2/2] [UU]
      [============>........]  resync = 64.2% (75326528/117219728) finish=41.9min speed=16623K/sec
      
unused devices: <none>
[root@localhost ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb[1] sda[0]
      117219728 blocks super 1.2 [2/2] [UU]
      [=============>.......]  resync = 66.3% (77803456/117219728) finish=7.4min speed=88551K/sec
      
unused devices: <none>
[root@localhost ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb[1] sda[0]
      117219728 blocks super 1.2 [2/2] [UU]
      [=============>.......]  resync = 66.4% (77938688/117219728) finish=6.4min speed=101045K/sec
      
unused devices: <none>
[root@localhost ~]#

Deleting a RAID Array

If a RAID volume is no longer required, it can be deactivated using the following command:

[root@localhost ~]# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
[root@localhost ~]#

The superblock for the individual devices (in this case, /dev/sda1 and /dev/sdb1 from the example above) will be deleted by the following commands. By doing this, you can re-use these partitions for new RAID arrays.

[root@localhost ~]# mdadm --zero-superblock /dev/sda1
[root@localhost ~]# mdadm --zero-superblock /dev/sdb1

Roadmap

Neil Brown published a roadmap for MD/RAID for 2011 on his blog:

Support for the ATA trim feature for SSDs (discard-support von Linux Software RAID) is periodically discussed. However this feature is still an the end of the list for future features (by end of June 2011):

References

  1. mdadm (en.wikipedia.org)
  2. ALERT: md/raid6 data corruption risk. (lkml.org, Neil Brown, 18.08.2014)
  3. RAID superblock formats - The version-0.90 Superblock Format (Linux Raid Wiki)
  4. does 3.1 offer (2): Storage and File Systems: Software RAID and Device Mapper (heise Open Kernel Log)
  5. RAID superblock formats - Sub-versions of the version-1 superblock (Linux Raid Wiki)
  6. SSDs vs. md/sync_speed_(min|max) (Lutz Vieweg, linux-raid mailing list, 18.07.2011)

Additional Information

Related articles

Ext4 Filesystem
Linux Page Cache Basics