VMware ESXi Hardware with Nagios or Icinga Monitoring

From Thomas-Krenn-Wiki
Jump to navigation Jump to search
Health Status shown in vSphere Client.
Icinga warning after an outage of an HDD in a server with a X9SCM-F motherboard.

VMware vSphere 5.5, vSphere 5.1 and vSphere 5.0 offer an integrated monitoring of the existing server hardware. The status of these components is checked by VMware through already built-in checks (eg for IPMI sensors) and corresponding CIM providers, such as for hardware RAID controllers.

The Plug-in check_esxi_hardware.py allows for easy monitoring of hardware components with Nagios or Icinga.

CIM Provider Requirements

The CIM Provider has to provide the hardware status information to the ESXi. For example, in this case the CIM provides for LSI RAID Controller:

Note: The CIM provider is not compatible with the Adaptec RAID controller (see Adaptec RAID Controller in VMware monitoring - Installation CIM Provider and aacraid driver)

Plugin

The plugin is available for download at the following address:

Information for the exchange.nagios.org:

Using the Plugin

The function of the plugin was tested with Thomas Krenn X8DT3 servers. This server ESXi 5.1 was installed with an integrated LSI CIM provider (provided by Thomas Krenn in the Download area). This allows the status of the LSI 9260-4i RAID controller to be monitored.

Python and the library must be installed in order to use the plugin. On Debian/Ubuntu this can be installed using

apt-get install python-pywbem

Then the plugin can be tested on the command line.

The most important parameters of the plugin are:

  • -H ... IP Address of the VMware ESXi server
  • -U ... Username or path to the username-password file (file:/path/to/.file)
  • -P ... Password or path to the password file (file:/path/to/.file)
  • -v ... verbose, shows all the sensors that are queried

For testing purposes, we use the root user from the ESXi server. In a productive environment a separate user on the vCenter server that only has permission to read the sensors should be created.

The plugin can be invoked as follows:

python check_esxi_hardware.py -H 10.X.X.X -U root -P password
WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i)  WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) - \
Server: Supermicro X8DT3 s/n: 1234567890 System BIOS: 2.0a 2010-09-14
echo $?
1

In this case a warning is received, because the RAID controller battery backup unit (BBU) is not fully charged.

It is recommended to specify the password in a file on the website of the plugin. This password does not appear in the process list while the scan is performed. There are two variants for that.

  • Enter only the password in the file
    • python check_esxi_hardware.py -H 10.X.X.X -U root -P file:/path/to/.file
  • Enter the username and passwort separated by spaces in the file
    • python check_esxi_hardware.py -H 10.X.X.X -U file:/path/to/.file -P file:/path/to/.file

Using the "-v" option is also interesting. This displays all the queried sensors and their status codes.

python check_esxi_hardware.py -H 10.1.102.143 -U tkmon -P relation -v
20130430 09:29:33 Connection to https://10.1.102.143
20130430 09:29:33 Check classe OMC_SMASHFirmwareIdentity
20130430 09:29:33   Element Name = System BIOS
20130430 09:29:33     VersionString = 2.0a
20130430 09:29:33 Check classe CIM_Chassis
20130430 09:29:33   Element Name = Chassis
20130430 09:29:33     Manufacturer = Supermicro
20130430 09:29:33     SerialNumber = 1234567890
20130430 09:29:33     Model = X8DT3
20130430 09:29:33     Element Op Status = 0
20130430 09:29:33 Check classe CIM_Card
20130430 09:29:34   Element Name = Motherboard
20130430 09:29:34     Element Op Status = 0
20130430 09:29:34 Check classe CIM_ComputerSystem
20130430 09:29:34   Element Name = System Board 7:1
20130430 09:29:34     Element Op Status = 0
20130430 09:29:34   Element Name = localhost
20130430 09:29:34   Element Name = Hardware Management Controller (Node 0)
20130430 09:29:34     Element Op Status = 0
20130430 09:29:34   Element Name = Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i)
20130430 09:29:34     Element Op Status = 3
20130430 09:29:34 GLobal exit set to WARNING
20130430 09:29:34 Check classe CIM_NumericSensor
20130430 09:29:35   Element Name = Memory Device 12 P2-DIMM3B Temp
20130430 09:29:35     sensorType = 2 - Temperature
20130430 09:29:35     BaseUnits = 2
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 42.000000
20130430 09:29:35     Lower Threshold Non Critical = -5.000000
20130430 09:29:35     Upper Threshold Non Critical = 75.000000
20130430 09:29:35     Lower Threshold Critical = -7.000000
20130430 09:29:35     Upper Threshold Critical = 80.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Memory Device 11 P2-DIMM3A Temp
20130430 09:29:35     sensorType = 2 - Temperature
20130430 09:29:35     BaseUnits = 2
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 45.000000
20130430 09:29:35     Lower Threshold Non Critical = -5.000000
20130430 09:29:35     Upper Threshold Non Critical = 75.000000
20130430 09:29:35     Lower Threshold Critical = -7.000000
20130430 09:29:35     Upper Threshold Critical = 80.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Memory Device 10 P2-DIMM2B Temp
20130430 09:29:35     sensorType = 2 - Temperature
20130430 09:29:35     BaseUnits = 2
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 39.000000
20130430 09:29:35     Lower Threshold Non Critical = -5.000000
20130430 09:29:35     Upper Threshold Non Critical = 75.000000
20130430 09:29:35     Lower Threshold Critical = -7.000000
20130430 09:29:35     Upper Threshold Critical = 80.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Memory Device 9 P2-DIMM2A Temp
20130430 09:29:35     sensorType = 2 - Temperature
20130430 09:29:35     BaseUnits = 2
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 41.000000
20130430 09:29:35     Lower Threshold Non Critical = -5.000000
20130430 09:29:35     Upper Threshold Non Critical = 75.000000
20130430 09:29:35     Lower Threshold Critical = -7.000000
20130430 09:29:35     Upper Threshold Critical = 80.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Memory Device 8 P2-DIMM1B Temp
20130430 09:29:35     sensorType = 2 - Temperature
20130430 09:29:35     BaseUnits = 2
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 39.000000
20130430 09:29:35     Lower Threshold Non Critical = -5.000000
20130430 09:29:35     Upper Threshold Non Critical = 75.000000
20130430 09:29:35     Lower Threshold Critical = -7.000000
20130430 09:29:35     Upper Threshold Critical = 80.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Memory Device 7 P2-DIMM1A Temp
20130430 09:29:35     sensorType = 2 - Temperature
20130430 09:29:35     BaseUnits = 2
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 39.000000
20130430 09:29:35     Lower Threshold Non Critical = -5.000000
20130430 09:29:35     Upper Threshold Non Critical = 75.000000
20130430 09:29:35     Lower Threshold Critical = -7.000000
20130430 09:29:35     Upper Threshold Critical = 80.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Fan Device 8 Fan8
20130430 09:29:35     sensorType = 5 - Tachometer
20130430 09:29:35     BaseUnits = 19
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 1890.000000
20130430 09:29:35     Lower Threshold Non Critical = 675.000000
20130430 09:29:35     Upper Threshold Non Critical = 34155.000000
20130430 09:29:35     Lower Threshold Critical = 540.000000
20130430 09:29:35     Upper Threshold Critical = 34290.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Fan Device 7 Fan7
20130430 09:29:35     sensorType = 5 - Tachometer
20130430 09:29:35     BaseUnits = 19
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 1890.000000
20130430 09:29:35     Lower Threshold Non Critical = 675.000000
20130430 09:29:35     Upper Threshold Non Critical = 34155.000000
20130430 09:29:35     Lower Threshold Critical = 540.000000
20130430 09:29:35     Upper Threshold Critical = 34290.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Fan Device 5 Fan5
20130430 09:29:35     sensorType = 5 - Tachometer
20130430 09:29:35     BaseUnits = 19
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 945.000000
20130430 09:29:35     Lower Threshold Non Critical = 675.000000
20130430 09:29:35     Upper Threshold Non Critical = 34155.000000
20130430 09:29:35     Lower Threshold Critical = 540.000000
20130430 09:29:35     Upper Threshold Critical = 34290.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Fan Device 2 Fan2
20130430 09:29:35     sensorType = 5 - Tachometer
20130430 09:29:35     BaseUnits = 19
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 1080.000000
20130430 09:29:35     Lower Threshold Non Critical = 675.000000
20130430 09:29:35     Upper Threshold Non Critical = 34155.000000
20130430 09:29:35     Lower Threshold Critical = 540.000000
20130430 09:29:35     Upper Threshold Critical = 34290.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Fan Device 1 Fan1
20130430 09:29:35     sensorType = 5 - Tachometer
20130430 09:29:35     BaseUnits = 19
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 945.000000
20130430 09:29:35     Lower Threshold Non Critical = 675.000000
20130430 09:29:35     Upper Threshold Non Critical = 34155.000000
20130430 09:29:35     Lower Threshold Critical = 540.000000
20130430 09:29:35     Upper Threshold Critical = 34290.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 VBAT
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 3.240000
20130430 09:29:35     Lower Threshold Non Critical = 2.920000
20130430 09:29:35     Upper Threshold Non Critical = 3.640000
20130430 09:29:35     Lower Threshold Critical = 2.900000
20130430 09:29:35     Upper Threshold Critical = 3.670000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 +12V
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 12.080000
20130430 09:29:35     Lower Threshold Non Critical = 10.700000
20130430 09:29:35     Upper Threshold Non Critical = 13.250000
20130430 09:29:35     Lower Threshold Critical = 10.650000
20130430 09:29:35     Upper Threshold Critical = 13.300000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 +5V
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 5.020000
20130430 09:29:35     Lower Threshold Non Critical = 4.480000
20130430 09:29:35     Upper Threshold Non Critical = 5.530000
20130430 09:29:35     Lower Threshold Critical = 4.440000
20130430 09:29:35     Upper Threshold Critical = 5.560000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 +3.3VSB
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 3.240000
20130430 09:29:35     Lower Threshold Non Critical = 2.920000
20130430 09:29:35     Upper Threshold Non Critical = 3.640000
20130430 09:29:35     Lower Threshold Critical = 2.900000
20130430 09:29:35     Upper Threshold Critical = 3.670000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 +3.3V
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 3.280000
20130430 09:29:35     Lower Threshold Non Critical = 2.920000
20130430 09:29:35     Upper Threshold Non Critical = 3.640000
20130430 09:29:35     Lower Threshold Critical = 2.900000
20130430 09:29:35     Upper Threshold Critical = 3.670000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 +1.5V
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 1.520000
20130430 09:29:35     Lower Threshold Non Critical = 1.330000
20130430 09:29:35     Upper Threshold Non Critical = 1.650000
20130430 09:29:35     Lower Threshold Critical = 1.320000
20130430 09:29:35     Upper Threshold Critical = 1.660000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 CPU2 DIMM
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 1.580000
20130430 09:29:35     Lower Threshold Non Critical = 1.190000
20130430 09:29:35     Upper Threshold Non Critical = 1.640000
20130430 09:29:35     Lower Threshold Critical = 1.190000
20130430 09:29:35     Upper Threshold Critical = 1.650000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 CPU2 Vcore
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 1.040000
20130430 09:29:35     Lower Threshold Non Critical = 0.820000
20130430 09:29:35     Upper Threshold Non Critical = 1.350000
20130430 09:29:35     Lower Threshold Critical = 0.810000
20130430 09:29:35     Upper Threshold Critical = 1.360000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 System Temp
20130430 09:29:35     sensorType = 2 - Temperature
20130430 09:29:35     BaseUnits = 2
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 36.000000
20130430 09:29:35     Lower Threshold Non Critical = -5.000000
20130430 09:29:35     Upper Threshold Non Critical = 75.000000
20130430 09:29:35     Lower Threshold Critical = -7.000000
20130430 09:29:35     Upper Threshold Critical = 77.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35 Check classe CIM_Memory
20130430 09:29:35   Element Name = CPU 2 Level-1 Cache
20130430 09:29:35     Element Op Status = 0
20130430 09:29:35   Element Name = CPU 2 Level-2 Cache
20130430 09:29:35     Element Op Status = 0
20130430 09:29:35   Element Name = CPU 2 Level-3 Cache
20130430 09:29:35     Element Op Status = 0
20130430 09:29:35   Element Name = Memory
20130430 09:29:35 Check classe CIM_Processor
20130430 09:29:36   Element Name = CPU 2
20130430 09:29:36     Family = 179
20130430 09:29:36     CurrentClockSpeed = 1866MHz
20130430 09:29:36     Element Op Status = 2
20130430 09:29:36 Check classe CIM_RecordLog
20130430 09:29:36 Check classe OMC_DiscreteSensor
20130430 09:29:36   Element Name = Power Supply 1 PS Status: Failure status
20130430 09:29:36     Element Op Status = 2
20130430 09:29:36   Element Name = System Chassis 1 Intrusion: General Chassis intrusion
20130430 09:29:36     Element Op Status = 2
20130430 09:29:36   Element Name = Processor 2 CPU2 Temp
20130430 09:29:36 Check classe OMC_Fan
20130430 09:29:37   Element Name = Fan8
20130430 09:29:37     Element Op Status = 2
20130430 09:29:37   Element Name = Fan7
20130430 09:29:37     Element Op Status = 2
20130430 09:29:37   Element Name = Fan5
20130430 09:29:37     Element Op Status = 2
20130430 09:29:37   Element Name = Fan2
20130430 09:29:37     Element Op Status = 2
20130430 09:29:37   Element Name = Fan1
20130430 09:29:37     Element Op Status = 2
20130430 09:29:37 Check classe OMC_PowerSupply
20130430 09:29:37   Element Name = Power Supply 1
20130430 09:29:37     Element Op Status = 2
20130430 09:29:37 Check classe VMware_StorageExtent
20130430 09:29:38   Element Name = Drive 252_5 on controller 500605B00418BB20 Fw: n/a - UNCONFIGURED GOOD
20130430 09:29:38     Element Op Status = 2
20130430 09:29:38   Element Name = Drive 252_4 on controller 500605B00418BB20 Fw: n/a - UNCONFIGURED GOOD
20130430 09:29:38     Element Op Status = 2
20130430 09:29:38 Check classe VMware_Controller
20130430 09:29:38   Element Name = Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i)
20130430 09:29:38     Element Op Status = 3
20130430 09:29:38 GLobal exit set to WARNING
20130430 09:29:38 Check classe VMware_StorageVolume
20130430 09:29:39   Element Name = RAID 1 StorageVolume Logical Volume 500605B00418BB20_0 on controller 500605B00418BB20, Drives( - OPTIMAL
20130430 09:29:39     Element Op Status = 2
20130430 09:29:39 Check classe VMware_Battery
20130430 09:29:39   Element Name = Battery 934 on Controller 500605B00418BB20
20130430 09:29:39     Element Op Status = 11
20130430 09:29:39 Check classe VMware_SASSATAPort
20130430 09:29:39   Element Name = Port 0 on Controller 500605B00418BB20
20130430 09:29:39     Element Op Status = 2
20130430 09:29:39   Element Name = Port 1 on Controller 500605B00418BB20
20130430 09:29:39     Element Op Status = 2
 WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i)  WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) -\
 Server: Supermicro X8DT3 s/n: 1234567890 System BIOS: 2.0a 2010-09-14

Alternate workaround for limited user rights without vCenter

When a vCenter server is not present, such as here a new user (ex. "tkmon") is created with read-only rights. Because ESXi 5.1 no longer supports local groups, the user must be added to the group root via SSH.

/etc/group

root:x:0:root,tkmon

Subsequently, the SSH access can be blocked for this user by setting the /sbin/nologin to shell.

/etc/passwd

tkmon:x:1000:1000:ESXi User:/:/sbin/nologin

In our tests, this approach means the sensors can be read, an SSH login is not possible and the vSphere Client is available only in read-only available.

We must, however, emphasize that this approach can not be supported by VMware or Thomas Krenn.

Integration into Icinga

There are different definition variations of a Nagios Command (commands.cfg). The easiest form is at Debian/Ubuntu:

# 'check_esxi_hardware' command definition
define command{
command_name check_esxi_hardware
command_line /usr/lib/nagios/plugins/check_esxi_hardware.py -H $HOSTADDRESS$ -U $ARG1$ -P $ARG2$
}

Other versions can be found at: hier.

Credit

Thanks to Sascha Peters for this valuable tip!


Foto Christoph Mitasch.jpg

Author: Christoph Mitasch

Christoph Mitasch works in the Web Operations & Knowledge Transfer team at Thomas-Krenn. He is responsible for the maintenance and further development of the webshop infrastructure. After an internship at IBM Linz, he finished his diploma studies "Computer- and Media-Security" at FH Hagenberg. He lives near Linz and beside working, he is an enthusiastic marathon runner and juggler, where he hold various world-records.


Related articles

Adaptec RAID Monitoring Plugin setup
GPU Sensor Monitoring Plugin Setup
Install and configure NSClient++ under Windows