Western Digital SN640 firmware updates R1110021 and R1410004

Western Digital has released the firmware updates R1110021 (SE/ISE variant) and R1410004 (TCG variant) for SN640 NVMe SSDs. The updates contain conventional ongoing improvements as a maintenance release on the one hand. In addition, the firmware updates also contain a bug fix for timeout errors that could occur in rare cases before.

Problem description

With earlier firmware versions, SSD timeouts can occur in rare cases, which can lead to a SSD failure. The following log excerpt of a Proxmox system with Ceph shows such a case:

[...]
Mar 23 23:48:43 node03 kernel: nvme nvme2: I/O 512 QID 23 timeout, aborting
Mar 23 23:48:43 node03 kernel: nvme nvme2: I/O 513 QID 23 timeout, aborting
Mar 23 23:48:43 node03 kernel: nvme nvme2: I/O 514 QID 23 timeout, aborting
Mar 23 23:48:43 node03 kernel: nvme nvme2: I/O 515 QID 23 timeout, aborting
Mar 23 23:48:49 node03 ceph-osd[3189]: 2022-03-23T23:48:49.471+0100 7f9d339b1700 -1 osd.11 6687 heartbeat_check: no reply from 192.168.75.92:6808 osd.10 since back 2022-03-23T23:48:23.068194+0100 front 2022-03-23T23:48:23.068173+0100 (oldest deadline 2022-03-23T23:48:48.967935+0100)
Mar 23 23:48:50 node03 pvestatd[3341]: VM 213 qmp command failed - VM 213 qmp command 'query-proxmox-support' failed - got timeout
Mar 23 23:48:51 node03 pvestatd[3341]: status update time (6.489 seconds)
Mar 23 23:48:55 node03 pmxcfs[2768]: [dcdb] notice: data verification successful
Mar 23 23:49:14 node03 kernel: nvme nvme2: I/O 512 QID 23 timeout, reset controller
Mar 23 23:49:45 node03 kernel: nvme nvme2: I/O 0 QID 0 timeout, reset controller
Mar 23 23:50:25 node03 kernel: nvme nvme2: Device not ready; aborting reset, CSTS=0x1
Mar 23 23:50:25 node03 kernel: nvme nvme2: Abort status: 0x371
Mar 23 23:50:25 node03 kernel: nvme nvme2: Abort status: 0x371
Mar 23 23:50:25 node03 kernel: nvme nvme2: Abort status: 0x371
Mar 23 23:50:25 node03 kernel: nvme nvme2: Abort status: 0x371
Mar 23 23:50:35 node03 kernel: nvme nvme2: Device not ready; aborting reset, CSTS=0x1
Mar 23 23:50:35 node03 kernel: nvme nvme2: Removing after probe failure status: -19
Mar 23 23:50:46 node03 ceph-osd[3176]: 2022-03-23T23:50:46.250+0100 7f6ac4847700 -1 bdev(0x56336572e400 /var/lib/ceph/osd/ceph-10/block) _aio_thread got r=-5 ((5) Input/output error)
Mar 23 23:50:46 node03 ceph-osd[3176]: 2022-03-23T23:50:46.250+0100 7f6abd029700 -1 bdev(0x56336572ec00 /var/lib/ceph/osd/ceph-10/block) _sync_write sync_file_range error: (5) Input/output error
Mar 23 23:50:46 node03 ceph-osd[3176]: 2022-03-23T23:50:46.250+0100 7f6ac3044700 -1 bdev(0x56336572ec00 /var/lib/ceph/osd/ceph-10/block) _aio_thread got r=-5 ((5) Input/output error)
Mar 23 23:50:46 node03 ceph-osd[3176]: 2022-03-23T23:50:46.250+0100 7f6ac684b700 -1 osd.10 6687 get_health_metrics reporting 153 slow ops, oldest is osd_op(client.23432791.0:11226906 4.4c 4:3305dcbc:::rbd_data.fa14dd8c319a13.0000000000003910:head [sparse-read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e6687)
Mar 23 23:50:46 node03 kernel: nvme nvme2: Device not ready; aborting reset, CSTS=0x1
Mar 23 23:50:46 node03 kernel: blk_update_request: I/O error, dev nvme2n1, sector 1308340248 op 0x1:(WRITE) flags 0x8800 phys_seg 32 prio class 0
Mar 23 23:50:46 node03 kernel: blk_update_request: I/O error, dev nvme2n1, sector 1852155432 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
Mar 23 23:50:46 node03 kernel: blk_update_request: I/O error, dev nvme2n1, sector 2743651928 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
Mar 23 23:50:46 node03 kernel: blk_update_request: I/O error, dev nvme2n1, sector 1029435456 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Mar 23 23:50:46 node03 kernel: nvme2n1: detected capacity change from 3750748848 to 0
Mar 23 23:50:46 node03 kernel: blk_update_request: I/O error, dev nvme2n1, sector 1446553816 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
Mar 23 23:50:46 node03 kernel: blk_update_request: I/O error, dev nvme2n1, sector 690010880 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Mar 23 23:50:46 node03 kernel: blk_update_request: I/O error, dev nvme2n1, sector 1424451176 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
Mar 23 23:50:46 node03 kernel: blk_update_request: I/O error, dev nvme2n1, sector 3461494728 op 0x1:(WRITE) flags 0x8800 phys_seg 3 prio class 0
Mar 23 23:50:46 node03 kernel: blk_update_request: I/O error, dev nvme2n1, sector 1424458824 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
Mar 23 23:50:46 node03 kernel: blk_update_request: I/O error, dev nvme2n1, sector 1424458984 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
Mar 23 23:50:46 node03 kernel: Buffer I/O error on dev dm-1, logical block 180818971, lost async page write
Mar 23 23:50:46 node03 kernel: Buffer I/O error on dev dm-1, logical block 180818972, lost async page write
[...]

The following errors may occur during a restart attempt:

Solution

An affected SSD can be recovered by formatting it - see section Recovering an SSD by formatting. The SSD's data is lost in the process and has to be recovered.

Afterwards, the firmware has to be updated so that such a problem does not occur again. The following firmware versions fix the problem, we recommend updating all SN640 SSDs with older firmware versions:

Type	Firmware version	published	download
Western Digital SN640 TCG	R1410004	02/2022	here
Western Digital SN640 SE	R1110021	04/2021
Western Digital SN640 ISE	R1110021	04/2021	here

Update firmware

Linux

Under Linux, the tool nvme-cli can be used to update the firmware. The tool can be installed under Debian and Ubuntu via the package manager. The installation instructions for other supported distributions can be found on GitHub. The new firmware can be installed with the following commands:

apt-get install nvme-cli
nvme fw-download /dev/nvmeXY --fw=FW.vpkg
nvme fw-commit /dev/nvmeXY -a 1

The new firmware is only active after a reboot of the server.

Example under Proxmox

root@pve:~# apt-get install nvme-cli
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  nvme-cli
0 upgraded, 1 newly installed, 0 to remove and 86 not upgraded.
Need to get 0 B/247 kB of archives.
After this operation, 570 kB of additional disk space will be used.
Selecting previously unselected package nvme-cli.
(Reading database ... 46034 files and directories currently installed.)
Preparing to unpack .../nvme-cli_1.7-1_amd64.deb ...
Unpacking nvme-cli (1.7-1) ...
Setting up nvme-cli (1.7-1) ...
Processing triggers for man-db (2.8.5-2) ...

root@pve:~# nvme fw-download /dev/nvme0 --fw=DCSN640_GR_R1410004.vpkg
Firmware download success

root@pve:~# nvme fw-commit /dev/nvme0 -a 1
Success committing firmware action:1 slot:0

Check firmware

After rebooting the server, "nvme list" can be used to check whether the new firmware is active.

root@jrag-pve-node02:~# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev  
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     A06BFXYZ             WUS4BB019D7P3E4                          1           1.92  TB /   1.92  TB    512   B +  0 B   R1410004
/dev/nvme1n1     A06BFXYZ             WUS4BB019D7P3E4                          1           1.92  TB /   1.92  TB    512   B +  0 B   R1410004
/dev/nvme2n1     A06F1XYZ             WUS4BB019D7P3E4                          1           1.92  TB /   1.92  TB    512   B +  0 B   R1410004
/dev/nvme3n1     A06F1XYZ             WUS4BB019D7P3E4                          1           1.92  TB /   1.92  TB    512   B +  0 B   R1410004

Windows

Section follows.

Linux Skript

A bash script is available in the GitHub repository SN640-FW-Update of Thomas-Krenn.AG, which can be used to update the firmware of all SN640 NVMe SSDs. The script recognizes which type it is based on the firmware revision or whether an older firmware is installed.

root@pve:~# ./FW-Update_SN640.sh 
Install needed programs (unzip, nvme-cli).
nvme-cli is already the newest version (1.7-1).
0 upgraded, 0 newly installed, 0 to remove and 86 not upgraded.
Download firmware files
Download completed


########################################
#Start update process - 5 NVMe found#
########################################

------------------------------------------------------------------------------
Firmware or NVMe is not known. Installed FW: 0105
Remaining NVMe: 4

------------------------------------------------------------------------------
Firmware of /dev/nvme3n1 is current: R1410004
Remaining NVMe: 3

------------------------------------------------------------------------------
Firmware needs to be updated for /dev/nvme2n1. Installed FW: R1410002
Success activating firmware action:1 slot:0, but firmware requires conventional reset
Update to /dev/nvme2n1 completed
Remaining NVMe: 2
[...]
####################################################################
#Updates completed. Changes will only be active after reboot#
####################################################################

Recover SSD by formatting

If the described "Format Corrupt" has occurred with a NVMe, it can be made ready for use again by formatting the namespace.

Note: Formatting the namespace does not restore any data. These must be restored by a backup/rebuild, etc.

nvme format /dev/nvmeXY

Example

root@pve:~# nvme format /dev/nvme0
Success formatting namespace:ffffffff

If the formatting is successful, no further I/O errors are displayed in the dmesg log after the "rescanning". The NVMe can be used again and integrated into the system.

[...]
[  110.099335] blk_update_request: I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 10 prio class 0
[  119.176576] blk_update_request: I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[  119.223403] blk_update_request: I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 17 prio class 0
[  128.716407] nvme nvme0: rescanning namespaces.

Author: Florian Sebald