Analyzing NVMe I/O error messages in Linux

From Thomas-Krenn-Wiki
Jump to navigation Jump to search

If NMVe I/O error situations occur on a Linux system, the kernel logs error information that can be queried using the dmesg command. In this article, we show how to analyze a NVMe I/O error notification.

Example of error I/O Error sct 0x0 sc 0x4 DNR

Completion Queue Entry Status Field: In Figure 100 der NVM Express® Base Specification Revision 2.3[1] The meaning of the bits Do Not Retry (DNR), More (M), Command Retry Delay (CRD), Status Code Type (SCT), and Status Code (SC) is explained.

On a test system with Linux kernel 6.8.0-55-generic #57-Ubuntu, I/O tests with fio cause the fio test to abort due to an I/O error. The following can be seen in the dmesg output:

The first line contains the following information:

[74942.261934] nvme0c0n1: I/O Cmd(0x1) @ LBA 218630656, 256 blocks, I/O Error (sct 0x0 / sc 0x4) DNR
[74942.262623] I/O error, dev nvme0c0n1, sector 218630656 op 0x1:(WRITE) flags 0x2008800 phys_seg 17 prio class 0

The following parts of the notification are particularly interesting:

  • I/O error
  • sct 0x0
  • sc 0x4
  • DNR

The meaning of these values are documented in the NVMe express base specification[2]. In chapter 4.2.3 (Status Field Definition) of the NVM Express® base specification revision 2.3[1], the individual elements of the status field are described as follows:

  • SCT (3 bits) - Status Code Type - states the type of status code returned by the controller (the NVMe SSD).
  • SC (8 bits) - Status Code - (see details below.)
  • CRD (2 bits) - Command Retry Delay - If the DNR-bit is set to „1“ in the status field, it is reserved. (See NVM express base specification for more details.)
  • M (1 bit) - More - If this bit is set to „1“, there is additional status information about this command as part of the “error information log page," which can be accessed using the "Get Log Page" command. If this bit is set to "0", there is no additional status information about this command.
  • DNR (1 bit) - Do Not Retry - If this bit is set to „1“, it means that if the same command is sent again to a controller in the NVM subsystem, it is likely to fail. If this bit is set to “0”, it means that the same command may be successful if repeated.

Status code type

Figure 101 explains the four status code types: Generic Command Status, Command Specific Status, Media and Data Integrity Errors, and Path Related Status.

The NVM express base specification defines multiple status code types:

Status Code Type (sct) Definition
0 generic command status
1 command specific status
2 media and data integrity errors
3 path related status
4, 5, 6 reserved
7 manufacturer-specific

In the example above (I/O error (sct 0x0 / sc 0x4) DNR), the status code type is 0 (sct 0x0).

Status code

Dependent on the respective status code type, the additional status code transmitted has a different meaning.

Status code (Generic Command Status)

Figure 102: Status Code meanings of Generic Command Status information. The status code is sc 0x4 in the example, which means Data Transfer Error.

The meaning of the status code in a generic command status (sct=0) is defined in chapter 4.2.3.1. Here is an excerpt:

Status Code Definition
0 Successful completion: The command was executed without errors.
1 Invalid command opcode: A reserved coded value or a not supported value in the command-Opcode-field.
2 Invalid field in command: A reserved coded value or a not supported value in a defined field (except for the Opcode-field).
3 Command ID conflict: The command identifier (Command ID) is already used. Hint: How many commands are searched after a conflict is dependent on the respective implementation.
4 Data transfer error: An error occurred while transferring the data or metadata associated with a command.
...

In the example above (I/O error (sct 0x0 / sc 0x4) DNR), the status code is 4 (sc 0x4). It is a matter of a data transfer error.

Status code (Command specific status)

The meaning of the status code of a command specific status (sct = 1) is defined in chapter 4.2.3.2. In this example, we will not go into detail about this.

Interpretation of status information

The status information "data transfer error" indicates problems during data transmission. Cables, plug connections or backplanes may be a potential cause.

The detail output of lspci shows that the affected NVMe SSD is connected via PCIe 5.0 (32GT/s):

# lspci -s 81:00.0 -vvv
81:00.0 Non-Volatile memory controller: KIOXIA Corporation Device 0013 (rev 01) (prog-if 02 [NVM Express])
        Subsystem: KIOXIA Corporation Device 0045
        Physical Slot: 1
        [...]
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                [...]
                LnkCap: Port #0, Speed 32GT/s, Width x4, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 32GT/s, Width x4
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                [...]
                LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot

In this example, a backplane with Oculink socket is used. However, Oculink does not support PCIe 5.0 according to specification PCIe 4.0.

To limit the connection to PCIe 4.0 speed (16GT/s), you can either use the setpci tool directly with the respective parameters or use the pci_set_speed.sh script from Alex Forenchich:[3]

# ./pci_set_speed.sh 0000:81:00.0 4
Link capabilities: 007b7845
Max link speed: 5
Link status: 7045
Current link speed: 5
Configuring 0000:80:01.1...
Original link control 2: 001e0005
Original link target speed: 5
New target link speed: 4
New link control 2: 001e0004
Triggering link retraining...
Original link control: 70450040
New link control: 70450060
Link status: 7044
Current link speed: 4

After executing the script, the PCIe link speed is reduced (LnkSta: Speed 16GT/s (downgraded)):

root@xfuel:~# lspci -s 81:00.0 -vvv
81:00.0 Non-Volatile memory controller: KIOXIA Corporation Device 0013 (rev 01) (prog-if 02 [NVM Express])
        Subsystem: KIOXIA Corporation Device 0045
        Physical Slot: 1
        [...]
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                [...]
                LnkCap: Port #0, Speed 32GT/s, Width x4, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s (downgraded), Width x4
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                [...]
                LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot

The following tests with fio did not show any problems. The cause for the data transfer error was the higher (32GT/s) transfer speed of PCIe 5.0 via components (backplanes) that are only approved up to PCIe 4.0.

Script for limiting the PCIe transmission rate

Here is the complete code of the bash-script from Alex Forenchich (License: [https://creativecommons.org/licenses/by-sa/4.0/ CC-BY-SA 4.0):<ref name=pcispeed>

    #!/bin/bash
     
    dev=$1
    speed=$2
     
    if [ -z "$dev" ]; then
        echo "Error: no device specified"
        exit 1
    fi
     
    if [ ! -e "/sys/bus/pci/devices/$dev" ]; then
        dev="0000:$dev"
    fi
     
    if [ ! -e "/sys/bus/pci/devices/$dev" ]; then
        echo "Error: device $dev not found"
        exit 1
    fi
     
    pciec=$(setpci -s $dev CAP_EXP+02.W)
    pt=$((("0x$pciec" & 0xF0) >> 4))
     
    port=$(basename $(dirname $(readlink "/sys/bus/pci/devices/$dev")))
     
    if (($pt == 0)) || (($pt == 1)) || (($pt == 5)); then
        dev=$port
    fi
     
    lc=$(setpci -s $dev CAP_EXP+0c.L)
    ls=$(setpci -s $dev CAP_EXP+12.W)
     
    max_speed=$(("0x$lc" & 0xF))
     
    echo "Link capabilities:" $lc
    echo "Max link speed:" $max_speed
    echo "Link status:" $ls
    echo "Current link speed:" $(("0x$ls" & 0xF))
     
    if [ -z "$speed" ]; then
        speed=$max_speed
    fi
     
    if (($speed > $max_speed)); then
        speed=$max_speed
    fi
     
    echo "Configuring $dev..."
     
    lc2=$(setpci -s $dev CAP_EXP+30.L)
     
    echo "Original link control 2:" $lc2
    echo "Original link target speed:" $(("0x$lc2" & 0xF))
     
    lc2n=$(printf "%08x" $((("0x$lc2" & 0xFFFFFFF0) | $speed)))
     
    echo "New target link speed:" $speed
    echo "New link control 2:" $lc2n
     
    setpci -s $dev CAP_EXP+30.L=$lc2n
     
    echo "Triggering link retraining..."
     
    lc=$(setpci -s $dev CAP_EXP+10.L)
     
    echo "Original link control:" $lc
     
    lcn=$(printf "%08x" $(("0x$lc" | 0x20)))
     
    echo "New link control:" $lcn
     
    setpci -s $dev CAP_EXP+10.L=$lcn
     
    sleep 0.1
     
    ls=$(setpci -s $dev CAP_EXP+12.W)
     
    echo "Link status:" $ls
    echo "Current link speed:" $(("0x$ls" & 0xF))

More information

References


Author: Werner Fischer

Werner Fischer, working in the Knowledge Transfer team at Thomas-Krenn, completed his studies of Computer and Media Security at FH Hagenberg in Austria. He is a regular speaker at many conferences like LinuxTag, OSMC, OSDC, LinuxCon, and author for various IT magazines. In his spare time he enjoys playing the piano and training for a good result at the annual Linz marathon relay.


Translator: Alina Ranzinger

Alina has been working at Thomas-Krenn.AG since 2024. After her training as multilingual business assistant, she got her job as assistant of the Product Management and is responsible for the translation of texts and for the organisation of the department.


Related articles

AER Multiple Corrected error received 0000:00:1c.4
Control Wlan Power Management under Linux with iw
Perform a SSD Secure Erase