Hardware error from APEI Generic Hardware Error Source

From Thomas-Krenn-Wiki
Jump to navigation Jump to search

Operating systems can document details about the errors in log files with the help of ACPI Platform Error Interfaces (APEI) when hardware errors occur. In this article, we show how, for example, a network card error can be located in Linux using the message "Hardware error from APEI Generic Hardware Error Source".

Fundamentals and terminology

The ACPI specification contains in chapter 18 extensive information on error reporting via ACPI Platform Error Interfaces (APEI).[1]

The ACPI specification provides extensive possibilities for the error reporting with the ACPI Platform Error Interfaces (APEI).[1] Operating systems such as Linux, Windows or FreeBSD can protocol information about hardware errors in log files.[2]

Terms frequently used in this context are:

  • ACPI: Advanced Configuration and Power Interface
  • APEI: ACPI Platform Error Interfaces
  • OSPM: OS-directed configuration and power management

Example

The following log entry from /var/log/syslog on an Ubuntu 18.04 system shows an error with a network card:

[Do Mär 26 07:38:49 2020] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]: event severity: corrected
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:  Error 0, type: corrected
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   section_type: PCIe error
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   port_type: 0, PCIe end point
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   version: 0.2
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   command: 0x0406, status: 0x0010
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   device_id: 0000:43:00.0
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   slot: 0
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   secondary_bus: 0x00
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x1563
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   class_code: 020000
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0000
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:  Error 1, type: corrected
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   section_type: PCIe error
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   port_type: 0, PCIe end point
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   version: 0.2
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   command: 0x0406, status: 0x0010
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   device_id: 0000:43:00.1
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   slot: 0
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   secondary_bus: 0x00
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x1563
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   class_code: 020000
[Do Mär 26 07:38:49 2020] {2}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0000
[Do Mär 26 07:38:49 2020] ixgbe 0000:43:00.0: AER: aer_status: 0x00001000, aer_mask: 0x00000000
[Do Mär 26 07:38:49 2020] ixgbe 0000:43:00.0: AER:    [12] Timeout              
[Do Mär 26 07:38:49 2020] ixgbe 0000:43:00.0: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
[Do Mär 26 07:38:49 2020] ixgbe 0000:43:00.1: AER: aer_status: 0x00001000, aer_mask: 0x00000000
[Do Mär 26 07:38:49 2020] ixgbe 0000:43:00.1: AER:    [12] Timeout              
[Do Mär 26 07:38:49 2020] ixgbe 0000:43:00.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID

The log entries contain the following information:

  • Hardware error from APEI Generic Hardware Error Source: 514
  • section_type: PCIe error
  • device_id: 0000:43:00.0
  • vendor_id: 0x8086, device_id: 0x1563
  • ixgbe 0000:43:00.0 [...] aer_layer=Data Link Layer

The output of lspci -nn shows that a Dual-Port X550 network card is affected in this example:

lspci -nn | grep 1563
43:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller 10G X550T [8086:1563] (rev 01)
43:00.1 Ethernet controller [0200]: Intel Corporation Ethernet Controller 10G X550T [8086:1563] (rev 01)

Possible causes

Three causes are possible in this example:

  1. Problems with the plug connection
  2. The Dual-Port X550 network card itself.
  3. Problems with the mainboard.

We recommend the following troubleshooting in such cases:

  1. Removing and reinserting the affected expansion card.
  2. Replacement of the affected expansion card (like network card in this example).
  3. Replacement of the mainboard.

More information

References

  1. 1.0 1.1 Advanced Configuration and Power Interface (ACPI) Specification Version 6.3 (uefi.org, 01/2019) Kapitel 18 ACPI Platform Error Interfaces (APEI) (Seite 834 ff.)
  2. APEI Generic Hardware Error Source support (github.com/torvalds/linux)


Author: Werner Fischer

Werner Fischer, working in the Knowledge Transfer team at Thomas-Krenn, completed his studies of Computer and Media Security at FH Hagenberg in Austria. He is a regular speaker at many conferences like LinuxTag, OSMC, OSDC, LinuxCon, and author for various IT magazines. In his spare time he enjoys playing the piano and training for a good result at the annual Linz marathon relay.


Translator: Alina Ranzinger

Alina has been working at Thomas-Krenn.AG since 2024. After her training as multilingual business assistant, she got her job as assistant of the Product Management and is responsible for the translation of texts and for the organisation of the department.


Related articles

Boot Error Record Table (BERT)
Execution Unit Scheduler Contention Side-Channel-Vulnerabilities on AMD processors
Size specification / form factor of additional cards