ZFS dRAID Basics and Configuration

From Thomas-Krenn-Wiki
Jump to navigation Jump to search

dRAID is a distributed Spare RAID implementation for ZFS. A dRAID vdev is composed of RAIDZ Array groups and is supported in Proxmox VE from version 7.3. The special feature of dRAID are its distributed hot spares.

This article provides an overview of the dRAID technology and instructions on how to set up a vdev based on dRAID on Proxmox 7.3.

Functionality

dRAID aims to distribute both parity and spare to all disks in the array. The characteristics of such an array will be explained herein.

Key Features

Simple representation of a slice. The groups are RAIDZ-1,2 or 3 arrays with static stripe length. thumb|Simple representation of a slice. The groups are RAIDZ-1,2 or 3 compounds with static stripe length. A dRAID consists of

  1. RAIDZ groups
  2. (distributed) spare(s)

These are arranged in so-called slices (see below). The RAIDZ groups are specified through the definition of the dRAID - a dRAID1 uses RAIDZ-1 groups, a dRAID3 uses RAIDZ-3 groups.

Therefore when defining a dRAID, the parameters data, parity and spare must always be specified. Optionally, the total number of mounted volumes (called children in dRAID) can also be specified.

Possible values are - analogous to RAIDZ levels - 1, 2 or 3 parities. The number of spares can be freely determined, the default value is 0. Data components are freely selectable up to a maximum. This maximum is determined by the total number of children minus the desired parities and spares within the array.

Here are some examples of possible definitions:

Data devices (children) definition parity data spares
4 dRAID:2d:1s:4c 1 2 1
11 dRAID2:7d:2s:11c 2 7 2
26 dRAID3:8d:2s:26c 3 8 2
50 dRAID2:7d:3s:50c 2 7 3

The stripe length of a dRAID is fixed - in contrast to a conventional RAIDZ. The stripe width is the product of disk sector size multiplied by the data portions, e.g. a disk sector size of 4k and a data portion of 7 (see example 2 above) would result in a stripe width of 7 * 4k = 28k.

Layout

A dRAID is an interaction of multiple parts[1]:

child Identifies a volume within a dRAID array.
row A 16 MB chunk that has the same offset on all volumes.
group Identifies the RAIDZ groups within the dRAID. They contain data and parity. If necessary, groups can be distributed across rows.
slice A slice is the frame for groups and rows within a dRAID array. It is completed as soon as the groups of a row are completely written.
device map The device map determines the distribution of data, parity and spare among the children in the array. It is divided into rows or columns.

The device map is defined for each slice. Therefore it is useful to clarify the structure of a slice:

750px

Each row within the device map represents a child in the dRAID. In each slice it is randomly permuted, i.e. rearranged. This ensures that data, parity and spares are randomly distributed among all children. This avoids hot-spots within the RAID set. The spare parts are static within a slice. They are distributed over the individual slices.

When including groups and rows, the whole picture of a dRAID comes together.

Let's consider the example of a dRAID2:4d:2s:11c. This definition means that the array is formed by 11 children. Each group contains 4 data and 2 parity, which corresponds to a RAIDZ-2. Additionally 2 spares are to be formed. Since 11 children are available, but only 9 children are used with data, parity and spares, groups must be distributed over several rows in this dRAID array. The best way to illustrate this is through a diagram:

750px

As long as the sum of the data, parity and spares is lower than the number of children, a dRAID can always be formed.

The number of groups within a slice depends on when a row is complete and without remainder. The necessary groups within an array can be determined with the help of a short calculation.

Calculating groups per slice

The columns of a group are determined by the sum of data and parity. The difference between children and spares determines how many columns must be filled by groups. The least common multiple of these two values then shows how many groups are needed until a row can be filled completely and without remainder:

n := groups per slice, d :=data, p := parity, c := children, s := spare
n * d+p = KgV(d+p,c-s)

In unserem Beispiel ergibt das:

n * d+p = KgV(d+p,c-s)
n *  6  = KgV(6,9)
n *  6  = 18
n       =  3

So 3 groups are necessary.

Comparison with other RAID types

The following table compares dRAID with other RAID types. It should be noted that the diagrammatic representation of dRAID is highly simplified and the device map is not taken into account.

designation file system volume minimum stripe length (data + parity) parity Allocation (parity) Allocation

(hot spare)

Diagram
RAID-5 - 3 Static 1 Distributed Static (Single Device) 400x400px
RAIDZ-1 [2,3] OpenZFS 3 Dynamic 1 [2,3] Distributed Static (Single Device) 400x400px
dRAID-1 [2,3] OpenZFS (as of 2.1.0) 3 Static 1 [2,3] Distributed Distributed Raid sets-dRAID Simpel.png

Setup

This section covers the setup of a dRAID. It works for all distributions that provide at least OpenZFS 2.1.0.

It is reasonable to create a dRAID using SCSI IDs, which makes it easier to replace disks later. A configuration using SCSI-IDs is easier via GUI, but is also possible via terminal.

Terminal

The setup via the terminal is very simple and requires only one command.

The setup succeeds with[2]

# zpool create <options> <pool> draid[<parity>][:<data>d][:<children>c][:<spares>s] <vdevs...>

Explanation:

Command Description
zpool package used to create a ZFS storage pool.
create command to create a zpool.
<options> Here you can set the regular options for a zpool.
<pool> Field for the name of the zpool to create.
draid[<parity>] command for defining the dRAID. Possible parities are 1 (default), 2 and 3.
[:<data>d] number of data devices per group. A lower value increases IOPS, compression ratio and improves resilvering time. Also reduces the usable data capacity (default 8).
[:<children>c] Total number of volumes to use.
[:<spares>s] Number of spare devices (default 0).
<vdevs...> Input of the vdevs to be used in the dRAID.

Note: [:d][:c][:s] can be used in any order.

This is an example configuration:

GUI

A dRAID is a regular zpool and is set up as usual:

Pool Comparison

In the following gallery you can see the difference between the (simple) setup via terminal setup via GUI:

Disk Failure

This section describes the resilvering and rebalancing of a dRAID in the event of a failure within the array.

Resilver

Due to the spare parts contained in the array, resilvering in dRAID works automatically as soon as the ZFS Event Daemon (ZED)[4] notices a failure. The following series of images illustrates the process within the Proxmox environment:

Types of Resilvering

dRAID supports both sequential resilver[5] and traditional healing resilver.

The advantage of sequential resilver on dristibuted hot spares is that rebuild time scales with the quotient of disks and stripe length[6].

By default, sequential resilver is enabled. This can be retrieved with the command[7]

# zpool get feature@device_rebuild

In our case, we get the following output:

# zpool get feature@device_rebuild
NAME   PROPERTY                VALUE                   SOURCE
rpool  feature@device_rebuild  enabled                 local
tank   feature@device_rebuild  enabled                 local

This value must be set when the array is created. A subsequent change is not possible:

# zpool set feature@device_rebuild=disabled tank
cannot set property for 'tank': property 'feature@device_rebuild' can only be set to 'disabled' at creation time

The correct command to create a dRAID without sequential resilver is:

# zpool create -o feature@device_rebuild=disabled <pool> ...

Example:

root@pve73-test:~# zpool create -o feature@device_rebuild=disabled tank draid:2d:2s:5c /dev/sd[d-h]
root@pve73-test:~# zpool get feature@device_rebuild
NAME   PROPERTY                VALUE                   SOURCE
rpool  feature@device_rebuild  enabled                 local
tank   feature@device_rebuild  disabled                local
root@pve73-test:~# 

Rebalance

This process describes the rebalancing of a RAID array by replacing a failed disk. If a hard drive with the same SCSI ID is mounted in the system, a rebalance takes place automatically. If a disk with a different SCSI ID is to be mounted, this is not possible via GUI in the current Proxmox version 7.3-3, but exclusively via the terminal.

The command for mounting a new drive to the array:

# zpool replace <pool> <device-old> <device-new>

The following series of pictures illustrates the process:

Further Information


References


Foto Stefan Bohn.jpg

Author: Stefan Bohn

Stefan Bohn has been employed at Thomas-Krenn.AG since 2020. Originally based in PreSales as a consultant for IT solutions, he moved to Product Management in 2022. There he dedicates himself to knowledge transfer and also drives the Thomas-Krenn Wiki.

Icon-LinkedIn.png

Related articles

ISCSI Basics