Linux Software RAID

Linux Software RAID (häufig auch als mdraid oder MD/RAID bezeichnet) ermöglicht die Nutzung von RAID Funktionalität ohne Hardware RAID Controller. Die dazu verwendeten Datenträger (Festplatten, SSDs, ...) werden dabei einfach als einzelne Laufwerke am Rechner angeschlossen, etwa direkt an den SATA Ports des Mainboards.

Hardware RAID Controller haben im Gegensatz zu Software RAID meistens einen eingebauten Cache (häufig 512 MB oder 1GB), der mit einer BBU oder ZMCP geschützt werden kann (siehe Unterschiede zwischen Hardware RAID und Linux Software RAID).

Funktionsweise

Linux Software RAID unterstützt die folgenden RAID Level:^[1]

SSD RAIDs

Linux Software RAID verwendet bis Kernel 3.11 nur einen Thread für RAID5/RAID6 Berechnungen. Das kann die Performance von SSD RAIDs limitieren. Im Januar 2013 gab es erste Patches von Fusion-io, diese waren zu diesem Zeitpunkt aber noch nicht reviewed.^[3] Die RAID5 multithreading Unterstützung wurde im Linux Kernel mit 3.12 aufgenommen.

RAID Superblock

Linux Software RAID speichert alle notwendigen Informationen zu einem RAID Array in einem Superblock. Je nach Metadaten-Version liegt dieser an unterschiedlichen Stellen.

Superblock Metadaten-Version 0.90

Der version-0.90 Superblock ist 4.096 Byte groß und liegt in einem 64 KiB aligned block am Ende eines Devices. Der Superblock beginnt ja nach Devicegröße frühestens 128 KiB vor dem Ende des Devices, bzw. spätestens 64 KiB vor dem Ende des Devices. Um die Adresse des Superblocks zu berechnen wird die Device-Größe auf ein vielfaches von 64 KiB abgerundet und dann 64 KiB vom Ergebnis abgezogen.^[4]

Einschränkungen der Metadaten-Version 0.90:

maximal 28 Devices in einem Array
jedes Device kann maximal 2 TiB groß sein
keine Unterstützung des Bad-Block-Managements^[5]

Superblock Metadaten-Version 1.*

Die Position des Superblock hängt von der Version der Metadaten ab:^[6]

Version 1.0: Der Superblock liegt am Ende des Devices.
Version 1.1: Der Superblock liegt am Anfang des Devices.
Version 1.2: Der Superblock liegt 4 KiB nach dem Beginn des Devices.

RAID erstellen

Das folgende Beispiel zeigt die Erstellung eines RAID 1. Im Beispiel wird ein Fedora 15 Live System verwendet.

Festplatten Cache deaktivieren

Wie auch bei Hardware RAID, ist es auch bei Software RAID zu empfehlen den Schreibcache von Festplatten zu deaktivieren, um bei einem Stromausfall keinen Datenverlust zu erleiden. Ausnahme sind dabei SSDs mit integrierten Kondensatoren, die den Cacheinhalt bei einem Stromausfall noch auf den Flash Speicher schreiben (z.B. Intel DC S3510 Series SSDs).

Partitionen vorbereiten

Das Software RAID wird über /dev/sda1 und /dev/sdb1 gelegt. Diese Partitionen haben den Typ Linux raid autodetect (fd):

[root@localhost ~]# fdisk -l /dev/sda

Disk /dev/sda: 120.0 GB, 120034123776 bytes
139 heads, 49 sectors/track, 34421 cylinders, total 234441648 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x2d0f2eb3

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048    20973567    10485760   fd  Linux raid autodetect
[root@localhost ~]# fdisk -l /dev/sdb

Disk /dev/sdb: 120.0 GB, 120034123776 bytes
139 heads, 49 sectors/track, 34421 cylinders, total 234441648 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xe69ef1f5

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048    20973567    10485760   fd  Linux raid autodetect
[root@localhost ~]#

RAID 1 erstellen

[root@localhost ~]# mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
[root@localhost ~]#

Der Fortschritt der Initialisierung wird über das proc-Dateisystem oder mdadm abgefragt:

[root@localhost ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb1[1] sda1[0]
      10484664 blocks super 1.2 [2/2] [UU]
      [========>............]  resync = 42.3% (4440832/10484664) finish=0.4min speed=201856K/sec
      
unused devices: <none>
[root@localhost ~]#

[root@localhost ~]# mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Tue Jul 26 07:49:50 2011
     Raid Level : raid1
     Array Size : 10484664 (10.00 GiB 10.74 GB)
  Used Dev Size : 10484664 (10.00 GiB 10.74 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Tue Jul 26 07:50:23 2011
          State : active, resyncing
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

 Rebuild Status : 62% complete

           Name : localhost.localdomain:0  (local to host localhost.localdomain)
           UUID : 3a8605c3:bf0bc5b3:823c9212:7b935117
         Events : 11

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
[root@localhost ~]#

Alignment überprüfen

Im Beispiel wird die Metadaten-Version 1.2 verwendet. Die Metadaten liegen also ziemlich am Anfang der Devices, die tatsächlichen Daten dahinter sind aber auf 1 MiB (Data Offset : 2048 sectors, ein Sektor hat 512 Byte) aligned:

[root@localhost ~]# mdadm -E /dev/sda1
/dev/sda1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3a8605c3:bf0bc5b3:823c9212:7b935117
           Name : localhost.localdomain:0  (local to host localhost.localdomain)
  Creation Time : Tue Jul 26 07:49:50 2011
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 20969472 (10.00 GiB 10.74 GB)
     Array Size : 20969328 (10.00 GiB 10.74 GB)
  Used Dev Size : 20969328 (10.00 GiB 10.74 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 10384215:18a75991:4f09b97b:1960b8cd

    Update Time : Tue Jul 26 07:50:43 2011
       Checksum : ea435554 - correct
         Events : 18


   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing)
[root@localhost ~]#

Abhängig von der mdadm Version variiert die Größe des Data Offset:

Hinweis: die aktuelle Entwicklerversion von mdadm bietet die Möglichkeit die Größe des Data Offset manuell zu spezifizieren (für --create, --grow, allerdings noch nicht für --add): Add --data-offset flag for Create and Grow
- Hinweis: Unter Umständen kann es zu Problemen kommen, wenn RAID Arrays mit unterschiedlichen mdadm-Versionen erzeugt werden:

If upon running the above with the --size parameter you get, as one of the authors of this page did, an error such as: "mdadm: /dev/sdb1 is smaller than given size. xxxK < yyyK + metadata", you may have stumbled upon a problem where the array was initially created with an earlier version of mdadm that reserved less device space. The solution seems to be to find an earlier version of mdadm to run with the creation command above (in this author's case, mdadm from Debian "squeeze" worked while mdadm from Debian "wheezy" refused to recreate the array of the required size).^[7] Das Problem hierbei ist eine unterschiedlich Größe der Arrays, da frühere mdadm Versionen weniger Speicherplatz reservierten.

ab mdadm-3.2.5: 128 MiB Data Offset (262144 sectors), sofern möglich: super1: fix choice of data_offset. (14.05.2012): While it is nice to set a high data_offset to leave plenty of head room it is much more important to leave enough space to allow of the data of the array. So after we check that sb->size is still available, only reduce the 'reserved', don't increase it. This fixes a bug where --adding a spare fails because it does not have enough space in it.
ab mdadm-3.2.4: 128 MiB Data Offset (262144 sectors) super1: leave more space in front of data by default. (04.04.2012): The kernel is growing the ability to avoid the need for a backup file during reshape by being able to change the data offset. For this to be useful we need plenty of free space before the data so the data offset can be reduced. So for v1.1 and v1.2 metadata make the default data_offset much larger. Aim for 128Meg, but keep a power of 2 and don't use more than 0.1% of each device. Don't change v1.0 as that is used when the data_offset is required to be zero.
ab mdadm-3.1.2: 1 MiB Data Offset (2048 sectors) super1: encourage data alignment on 1Meg boundary (03.03.2010): For 1.1 and 1.2 metadata where data_offset is not zero, it is important to align the data_offset to underlying block size. We don't currently have access to the particular device in avail_size so just try to force to a 1Meg boundary. Also default 1.x metadata to 1.2 as documented. (siehe auch Re: Mixing mdadm versions)

Sync Rate anpassen

Ein RAID Volume kann direkt nach der Erstellung schon während der Synchronisation genutzt werden. Das vermindert aber die Sync Rate.

In diesem Beispiel mit einem RAID 1 direkt über zwei SSDs (ohne Partitionen auf /dev/sda und /dev/sdb) beginnt die Synchronisation mit ca. 200 MB/s und fällt auf 2,5 MB/s zurück sobald Daten auf das Dateisystem des RAID 1 geschrieben werden:

[root@localhost ~]# dd if=/dev/urandom of=/mnt/testfile-1-1G bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 115.365 s, 9.3 MB/s
[root@localhost ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb[1] sda[0]
      117219728 blocks super 1.2 [2/2] [UU]
      [============>........]  resync = 63.3% (74208384/117219728) finish=279.5min speed=2564K/sec
      
unused devices: <none>
[root@localhost ~]#

Durch eine manuelle Erhöhung der Sync Rate lässt sich die Synchronisation wieder beschleunigen:^[8]

[root@localhost ~]# cat /proc/sys/dev/raid/speed_limit_max
200000
[root@localhost ~]# cat /proc/sys/dev/raid/speed_limit_min 
1000
[root@localhost ~]# echo 100000 > /proc/sys/dev/raid/speed_limit_min 
[root@localhost ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb[1] sda[0]
      117219728 blocks super 1.2 [2/2] [UU]
      [============>........]  resync = 64.2% (75326528/117219728) finish=41.9min speed=16623K/sec
      
unused devices: <none>
[root@localhost ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb[1] sda[0]
      117219728 blocks super 1.2 [2/2] [UU]
      [=============>.......]  resync = 66.3% (77803456/117219728) finish=7.4min speed=88551K/sec
      
unused devices: <none>
[root@localhost ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb[1] sda[0]
      117219728 blocks super 1.2 [2/2] [UU]
      [=============>.......]  resync = 66.4% (77938688/117219728) finish=6.4min speed=101045K/sec
      
unused devices: <none>
[root@localhost ~]#

RAID löschen

Wird ein RAID Volume nicht mehr benötigt, kann es mit dem folgenden Kommando deaktiviert werden:

[root@localhost ~]# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
[root@localhost ~]#

Der Superblock der einzelnen Devices (in diesem Fall /dev/sda1 und /dev/sdb1 vom obigen Beispiel) wird mit folgenden Kommandos gelöscht - damit können Sie diese Partitionen wieder für neue RAID Verbünde nützen.

[root@localhost ~]# mdadm --zero-superblock /dev/sda1
[root@localhost ~]# mdadm --zero-superblock /dev/sdb1

ATA Trim

Im August 2012 hat Shaohua Li ein Patchset eingereicht, das ATA TRIM Support bringt (siehe ATA Trim mit Linux Software RAID). ATA Trim mit Linux Software RAID ist damit ab Linux Kernel 3.7 möglich.

Ab Linux Kernel 3.17 (bzw. 3.10.57 für Kernel für den longterm 3.10) wird für RAID4/5/6 ATA Trim standardmäßig deaktiviert, weil manche SSDs zwar melden, dass ein ATA Trim Daten zuverlässig löscht und danach bei Lesezugriffen Null als Antwort kommt (discard_zeroes_data) - dies allerdings dann nicht tun. Ein Modul Parameter erlaubt weiterhin, bei korrekt funktionierenden SSDs Trim zu verwenden.^[9] Ab Linux Kernel 3.19 gibt es eine Whitelist, die zuverlässig funktionierende SSDs auflistet (z.B. alle Intel SSDs bis auf die 520 Serie).^[10] Beim Einsatz solcher SSDs kann der Administrator überlegen, den Modul Parameter zu setzen. Im Patch wird jedoch ausdrücklich darauf hingewiesen, dass diese Whitelist auf Tests beruht und nicht auf Produktspezifikationen: This patch whitelists SSDs from a few of the main vendors. None of the whitelists are based on written guarantees. They are purely based on empirical evidence collected from internal and external users that have tested or qualified these drives in RAID deployments.

Im Juli 2015 sind Probleme zwischen dem RAID- und dem SATA-Treiber des Linux Kernels bekannt geworden, die zu Datenverlust beim Einsatz von ATA Trim bei RAID0 oder RAID10 führen können.^[11] Ein Patch, der die bio_split Funktion anpasst, behebt das Problem.^[12] Der Patch wird in Linux Kernel 4.2 einfließen.^[13]

Einzelnachweise

↑ mdadm (en.wikipedia.org)
↑ ALERT: md/raid6 data corruption risk. (lkml.org, Neil Brown, 18.08.2014)
↑ RAID is more than parity and mirrors Threads für RAID5/RAID6, ca. Min. 25 (Vortrag Neil Brown bei der LCA 2013)
↑ RAID superblock formats - The version-0.90 Superblock Format (Linux Raid Wiki)
↑ Was 3.1 bringt (2): Storage und Dateisysteme: Software-RAID und Device Mapper (heise Open Kernel Log)
↑ RAID superblock formats - Sub-versions of the version-1 superblock (Linux Raid Wiki)
↑ raid Wiki RAID Recovery (raid.wiki.kernel.org)
↑ SSDs vs. md/sync_speed_(min|max) (Lutz Vieweg, linux-raid mailing list, 18.07.2011)
↑ one last-minute update for md/raid5 As we cannot trust 'discard_zeroes_data', ignore it by default and so disallow DISCARD on all raid4/5/6 arrays. As many devices are trustworthy, and as there are benefits to using DISCARD, add a module parameter to over-ride this caution and cause DISCARD to work if discard_zeroes_data is set. If a site want to enable DISCARD on some arrays but not on others they should select DISCARD support at the filesystem level, and set the raid456 module parameter. raid456.devices_handle_discard_safely=Y
↑ ibata: Whitelist SSDs that are known to properly return zeroes after TRIM (git.kernel.org)
↑ Linux: RAID und SSD-TRIM können zu Datenverlust führen (www.admin-magazin.de, 24.07.2015)
↑ block: Do a full clone when splitting discard bios (git.kernel.org, 23.07.2015)
↑ (GIT PULL) Block fixes for 4.2-rc3 Linux Kernel Mailing List

Weitere Informationen

The Software-RAID HOWTO
Linux Raid Wiki (raid.wiki.kernel.org)
RAID Setup (raid.wiki.kernel.org)
Festplattenpuzzles - Tipps und Tricks rund um Linux-Software-RAID (e't Magazin 06/2013, Seite 184)
Workshop - Software-RAID unter Linux einrichten (tecchannel.de, 17.04.2011)
Quick HOWTO : Ch26 : Linux Software RAID (linuxhomenetworking.com)
linux-raid Mailing List
Ubuntu-Server-Installation mit Software-RAID

Autor: Werner Fischer

Werner Fischer arbeitet im Product Management Team von Thomas-Krenn. Er evaluiert dabei neueste Technologien und teilt sein Wissen in Fachartikeln, bei Konferenzen und im Thomas-Krenn Wiki. Bereits 2005 - ein Jahr nach seinem Abschluss des Studiums zu Computer- und Mediensicherheit an der FH Hagenberg - heuerte er beim bayerischen Server-Hersteller an. Als Öffi-Fan nutzt er gerne Bus & Bahn und genießt seinen morgendlichen Spaziergang ins Büro.

[1] (en.wikipedia.org)

[2] ALERT: md/raid6 data corruption risk. (lkml.org, Neil Brown, 18.08.2014)

[3] RAID is more than parity and mirrors Threads für RAID5/RAID6, ca. Min. 25 (Vortrag Neil Brown bei der LCA 2013)

[4] RAID superblock formats - The version-0.90 Superblock Format (Linux Raid Wiki)

[5] Was 3.1 bringt (2): Storage und Dateisysteme: Software-RAID und Device Mapper (heise Open Kernel Log)

[6] RAID superblock formats - Sub-versions of the version-1 superblock (Linux Raid Wiki)

[7] raid Wiki RAID Recovery (raid.wiki.kernel.org)

[8] SSDs vs. md/sync_speed_(min|max) (Lutz Vieweg, linux-raid mailing list, 18.07.2011)

[9] st-minute update for md/raid5 As we cannot trust 'discard_zeroes_data', ignore it by default and so disallow DISCARD on all raid4/5/6 arrays. As many devices are trustworthy, and as there are benefits to using DISCARD, add a module parameter to over-ride this caution and cause DISCARD to work if discard_zeroes_data is set. If a site want to enable DISCARD on some arrays but not on others they should select DISCARD support at the filesystem level, and set the raid456 module parameter. raid456.devices_handle_discard_safely=Y

[10] ta: Whitelist SSDs that are known to properly return zeroes after TRIM (git.kernel.org)

[11] Linux: RAID und SSD-TRIM können zu Datenverlust führen (www.admin-magazin.de, 24.07.2015)

[12] : Do a full clone when splitting discard bios (git.kernel.org, 23.07.2015)

[13] (GIT PULL) Block fixes for 4.2-rc3 Linux Kernel Mailing List

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]