VMware ESXi Hardware with Nagios or Icinga Monitoring
VMware vSphere 5.5, vSphere 5.1 and vSphere 5.0 offer an integrated monitoring of the existing server hardware. The status of these components is checked by VMware through already built-in checks (eg for IPMI sensors) and corresponding CIM providers, such as for hardware RAID controllers.
The Plug-in check_esxi_hardware.py allows for easy monitoring of hardware components with Nagios or Icinga.
CIM Provider Requirements
The CIM Provider has to provide the hardware status information to the ESXi. For example, in this case the CIM provides for LSI RAID Controller:
Note: The CIM provider is not compatible with the Adaptec RAID controller (see Adaptec RAID Controller in VMware monitoring - Installation CIM Provider and aacraid driver)
Plugin
The plugin is available for download at the following address:
Information for the exchange.nagios.org:
Using the Plugin
The function of the plugin was tested with Thomas Krenn X8DT3 servers. This server ESXi 5.1 was installed with an integrated LSI CIM provider (provided by Thomas Krenn in the Download area). This allows the status of the LSI 9260-4i RAID controller to be monitored.
Python and the library must be installed in order to use the plugin. On Debian/Ubuntu this can be installed using
apt-get install python-pywbem
Then the plugin can be tested on the command line.
The most important parameters of the plugin are:
- -H ... IP Address of the VMware ESXi server
- -U ... Username or path to the username-password file (file:/path/to/.file)
- -P ... Password or path to the password file (file:/path/to/.file)
- -v ... verbose, shows all the sensors that are queried
For testing purposes, we use the root user from the ESXi server. In a productive environment a separate user on the vCenter server that only has permission to read the sensors should be created.
The plugin can be invoked as follows:
python check_esxi_hardware.py -H 10.X.X.X -U root -P password WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) - \ Server: Supermicro X8DT3 s/n: 1234567890 System BIOS: 2.0a 2010-09-14 echo $? 1
In this case a warning is received, because the RAID controller battery backup unit (BBU) is not fully charged.
It is recommended to specify the password in a file on the website of the plugin. This password does not appear in the process list while the scan is performed. There are two variants for that.
- Enter only the password in the file
- python check_esxi_hardware.py -H 10.X.X.X -U root -P file:/path/to/.file
- Enter the username and passwort separated by spaces in the file
- python check_esxi_hardware.py -H 10.X.X.X -U file:/path/to/.file -P file:/path/to/.file
Using the "-v" option is also interesting. This displays all the queried sensors and their status codes.
python check_esxi_hardware.py -H 10.1.102.143 -U tkmon -P relation -v 20130430 09:29:33 Connection to https://10.1.102.143 20130430 09:29:33 Check classe OMC_SMASHFirmwareIdentity 20130430 09:29:33 Element Name = System BIOS 20130430 09:29:33 VersionString = 2.0a 20130430 09:29:33 Check classe CIM_Chassis 20130430 09:29:33 Element Name = Chassis 20130430 09:29:33 Manufacturer = Supermicro 20130430 09:29:33 SerialNumber = 1234567890 20130430 09:29:33 Model = X8DT3 20130430 09:29:33 Element Op Status = 0 20130430 09:29:33 Check classe CIM_Card 20130430 09:29:34 Element Name = Motherboard 20130430 09:29:34 Element Op Status = 0 20130430 09:29:34 Check classe CIM_ComputerSystem 20130430 09:29:34 Element Name = System Board 7:1 20130430 09:29:34 Element Op Status = 0 20130430 09:29:34 Element Name = localhost 20130430 09:29:34 Element Name = Hardware Management Controller (Node 0) 20130430 09:29:34 Element Op Status = 0 20130430 09:29:34 Element Name = Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) 20130430 09:29:34 Element Op Status = 3 20130430 09:29:34 GLobal exit set to WARNING 20130430 09:29:34 Check classe CIM_NumericSensor 20130430 09:29:35 Element Name = Memory Device 12 P2-DIMM3B Temp 20130430 09:29:35 sensorType = 2 - Temperature 20130430 09:29:35 BaseUnits = 2 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 42.000000 20130430 09:29:35 Lower Threshold Non Critical = -5.000000 20130430 09:29:35 Upper Threshold Non Critical = 75.000000 20130430 09:29:35 Lower Threshold Critical = -7.000000 20130430 09:29:35 Upper Threshold Critical = 80.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Memory Device 11 P2-DIMM3A Temp 20130430 09:29:35 sensorType = 2 - Temperature 20130430 09:29:35 BaseUnits = 2 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 45.000000 20130430 09:29:35 Lower Threshold Non Critical = -5.000000 20130430 09:29:35 Upper Threshold Non Critical = 75.000000 20130430 09:29:35 Lower Threshold Critical = -7.000000 20130430 09:29:35 Upper Threshold Critical = 80.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Memory Device 10 P2-DIMM2B Temp 20130430 09:29:35 sensorType = 2 - Temperature 20130430 09:29:35 BaseUnits = 2 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 39.000000 20130430 09:29:35 Lower Threshold Non Critical = -5.000000 20130430 09:29:35 Upper Threshold Non Critical = 75.000000 20130430 09:29:35 Lower Threshold Critical = -7.000000 20130430 09:29:35 Upper Threshold Critical = 80.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Memory Device 9 P2-DIMM2A Temp 20130430 09:29:35 sensorType = 2 - Temperature 20130430 09:29:35 BaseUnits = 2 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 41.000000 20130430 09:29:35 Lower Threshold Non Critical = -5.000000 20130430 09:29:35 Upper Threshold Non Critical = 75.000000 20130430 09:29:35 Lower Threshold Critical = -7.000000 20130430 09:29:35 Upper Threshold Critical = 80.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Memory Device 8 P2-DIMM1B Temp 20130430 09:29:35 sensorType = 2 - Temperature 20130430 09:29:35 BaseUnits = 2 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 39.000000 20130430 09:29:35 Lower Threshold Non Critical = -5.000000 20130430 09:29:35 Upper Threshold Non Critical = 75.000000 20130430 09:29:35 Lower Threshold Critical = -7.000000 20130430 09:29:35 Upper Threshold Critical = 80.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Memory Device 7 P2-DIMM1A Temp 20130430 09:29:35 sensorType = 2 - Temperature 20130430 09:29:35 BaseUnits = 2 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 39.000000 20130430 09:29:35 Lower Threshold Non Critical = -5.000000 20130430 09:29:35 Upper Threshold Non Critical = 75.000000 20130430 09:29:35 Lower Threshold Critical = -7.000000 20130430 09:29:35 Upper Threshold Critical = 80.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Fan Device 8 Fan8 20130430 09:29:35 sensorType = 5 - Tachometer 20130430 09:29:35 BaseUnits = 19 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 1890.000000 20130430 09:29:35 Lower Threshold Non Critical = 675.000000 20130430 09:29:35 Upper Threshold Non Critical = 34155.000000 20130430 09:29:35 Lower Threshold Critical = 540.000000 20130430 09:29:35 Upper Threshold Critical = 34290.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Fan Device 7 Fan7 20130430 09:29:35 sensorType = 5 - Tachometer 20130430 09:29:35 BaseUnits = 19 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 1890.000000 20130430 09:29:35 Lower Threshold Non Critical = 675.000000 20130430 09:29:35 Upper Threshold Non Critical = 34155.000000 20130430 09:29:35 Lower Threshold Critical = 540.000000 20130430 09:29:35 Upper Threshold Critical = 34290.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Fan Device 5 Fan5 20130430 09:29:35 sensorType = 5 - Tachometer 20130430 09:29:35 BaseUnits = 19 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 945.000000 20130430 09:29:35 Lower Threshold Non Critical = 675.000000 20130430 09:29:35 Upper Threshold Non Critical = 34155.000000 20130430 09:29:35 Lower Threshold Critical = 540.000000 20130430 09:29:35 Upper Threshold Critical = 34290.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Fan Device 2 Fan2 20130430 09:29:35 sensorType = 5 - Tachometer 20130430 09:29:35 BaseUnits = 19 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 1080.000000 20130430 09:29:35 Lower Threshold Non Critical = 675.000000 20130430 09:29:35 Upper Threshold Non Critical = 34155.000000 20130430 09:29:35 Lower Threshold Critical = 540.000000 20130430 09:29:35 Upper Threshold Critical = 34290.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Fan Device 1 Fan1 20130430 09:29:35 sensorType = 5 - Tachometer 20130430 09:29:35 BaseUnits = 19 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 945.000000 20130430 09:29:35 Lower Threshold Non Critical = 675.000000 20130430 09:29:35 Upper Threshold Non Critical = 34155.000000 20130430 09:29:35 Lower Threshold Critical = 540.000000 20130430 09:29:35 Upper Threshold Critical = 34290.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 VBAT 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 3.240000 20130430 09:29:35 Lower Threshold Non Critical = 2.920000 20130430 09:29:35 Upper Threshold Non Critical = 3.640000 20130430 09:29:35 Lower Threshold Critical = 2.900000 20130430 09:29:35 Upper Threshold Critical = 3.670000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 +12V 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 12.080000 20130430 09:29:35 Lower Threshold Non Critical = 10.700000 20130430 09:29:35 Upper Threshold Non Critical = 13.250000 20130430 09:29:35 Lower Threshold Critical = 10.650000 20130430 09:29:35 Upper Threshold Critical = 13.300000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 +5V 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 5.020000 20130430 09:29:35 Lower Threshold Non Critical = 4.480000 20130430 09:29:35 Upper Threshold Non Critical = 5.530000 20130430 09:29:35 Lower Threshold Critical = 4.440000 20130430 09:29:35 Upper Threshold Critical = 5.560000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 +3.3VSB 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 3.240000 20130430 09:29:35 Lower Threshold Non Critical = 2.920000 20130430 09:29:35 Upper Threshold Non Critical = 3.640000 20130430 09:29:35 Lower Threshold Critical = 2.900000 20130430 09:29:35 Upper Threshold Critical = 3.670000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 +3.3V 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 3.280000 20130430 09:29:35 Lower Threshold Non Critical = 2.920000 20130430 09:29:35 Upper Threshold Non Critical = 3.640000 20130430 09:29:35 Lower Threshold Critical = 2.900000 20130430 09:29:35 Upper Threshold Critical = 3.670000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 +1.5V 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 1.520000 20130430 09:29:35 Lower Threshold Non Critical = 1.330000 20130430 09:29:35 Upper Threshold Non Critical = 1.650000 20130430 09:29:35 Lower Threshold Critical = 1.320000 20130430 09:29:35 Upper Threshold Critical = 1.660000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 CPU2 DIMM 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 1.580000 20130430 09:29:35 Lower Threshold Non Critical = 1.190000 20130430 09:29:35 Upper Threshold Non Critical = 1.640000 20130430 09:29:35 Lower Threshold Critical = 1.190000 20130430 09:29:35 Upper Threshold Critical = 1.650000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 CPU2 Vcore 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 1.040000 20130430 09:29:35 Lower Threshold Non Critical = 0.820000 20130430 09:29:35 Upper Threshold Non Critical = 1.350000 20130430 09:29:35 Lower Threshold Critical = 0.810000 20130430 09:29:35 Upper Threshold Critical = 1.360000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 System Temp 20130430 09:29:35 sensorType = 2 - Temperature 20130430 09:29:35 BaseUnits = 2 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 36.000000 20130430 09:29:35 Lower Threshold Non Critical = -5.000000 20130430 09:29:35 Upper Threshold Non Critical = 75.000000 20130430 09:29:35 Lower Threshold Critical = -7.000000 20130430 09:29:35 Upper Threshold Critical = 77.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Check classe CIM_Memory 20130430 09:29:35 Element Name = CPU 2 Level-1 Cache 20130430 09:29:35 Element Op Status = 0 20130430 09:29:35 Element Name = CPU 2 Level-2 Cache 20130430 09:29:35 Element Op Status = 0 20130430 09:29:35 Element Name = CPU 2 Level-3 Cache 20130430 09:29:35 Element Op Status = 0 20130430 09:29:35 Element Name = Memory 20130430 09:29:35 Check classe CIM_Processor 20130430 09:29:36 Element Name = CPU 2 20130430 09:29:36 Family = 179 20130430 09:29:36 CurrentClockSpeed = 1866MHz 20130430 09:29:36 Element Op Status = 2 20130430 09:29:36 Check classe CIM_RecordLog 20130430 09:29:36 Check classe OMC_DiscreteSensor 20130430 09:29:36 Element Name = Power Supply 1 PS Status: Failure status 20130430 09:29:36 Element Op Status = 2 20130430 09:29:36 Element Name = System Chassis 1 Intrusion: General Chassis intrusion 20130430 09:29:36 Element Op Status = 2 20130430 09:29:36 Element Name = Processor 2 CPU2 Temp 20130430 09:29:36 Check classe OMC_Fan 20130430 09:29:37 Element Name = Fan8 20130430 09:29:37 Element Op Status = 2 20130430 09:29:37 Element Name = Fan7 20130430 09:29:37 Element Op Status = 2 20130430 09:29:37 Element Name = Fan5 20130430 09:29:37 Element Op Status = 2 20130430 09:29:37 Element Name = Fan2 20130430 09:29:37 Element Op Status = 2 20130430 09:29:37 Element Name = Fan1 20130430 09:29:37 Element Op Status = 2 20130430 09:29:37 Check classe OMC_PowerSupply 20130430 09:29:37 Element Name = Power Supply 1 20130430 09:29:37 Element Op Status = 2 20130430 09:29:37 Check classe VMware_StorageExtent 20130430 09:29:38 Element Name = Drive 252_5 on controller 500605B00418BB20 Fw: n/a - UNCONFIGURED GOOD 20130430 09:29:38 Element Op Status = 2 20130430 09:29:38 Element Name = Drive 252_4 on controller 500605B00418BB20 Fw: n/a - UNCONFIGURED GOOD 20130430 09:29:38 Element Op Status = 2 20130430 09:29:38 Check classe VMware_Controller 20130430 09:29:38 Element Name = Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) 20130430 09:29:38 Element Op Status = 3 20130430 09:29:38 GLobal exit set to WARNING 20130430 09:29:38 Check classe VMware_StorageVolume 20130430 09:29:39 Element Name = RAID 1 StorageVolume Logical Volume 500605B00418BB20_0 on controller 500605B00418BB20, Drives( - OPTIMAL 20130430 09:29:39 Element Op Status = 2 20130430 09:29:39 Check classe VMware_Battery 20130430 09:29:39 Element Name = Battery 934 on Controller 500605B00418BB20 20130430 09:29:39 Element Op Status = 11 20130430 09:29:39 Check classe VMware_SASSATAPort 20130430 09:29:39 Element Name = Port 0 on Controller 500605B00418BB20 20130430 09:29:39 Element Op Status = 2 20130430 09:29:39 Element Name = Port 1 on Controller 500605B00418BB20 20130430 09:29:39 Element Op Status = 2 WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) -\ Server: Supermicro X8DT3 s/n: 1234567890 System BIOS: 2.0a 2010-09-14
Alternate workaround for limited user rights without vCenter
When a vCenter server is not present, such as here a new user (ex. "tkmon") is created with read-only rights. Because ESXi 5.1 no longer supports local groups, the user must be added to the group root via SSH.
/etc/group
root:x:0:root,tkmon
Subsequently, the SSH access can be blocked for this user by setting the /sbin/nologin to shell.
/etc/passwd
tkmon:x:1000:1000:ESXi User:/:/sbin/nologin
In our tests, this approach means the sensors can be read, an SSH login is not possible and the vSphere Client is available only in read-only available.
We must, however, emphasize that this approach can not be supported by VMware or Thomas Krenn.
Integration into Icinga
There are different definition variations of a Nagios Command (commands.cfg). The easiest form is at Debian/Ubuntu:
# 'check_esxi_hardware' command definition define command{ command_name check_esxi_hardware command_line /usr/lib/nagios/plugins/check_esxi_hardware.py -H $HOSTADDRESS$ -U $ARG1$ -P $ARG2$ }
Other versions can be found at: hier.
Credit
Thanks to Sascha Peters for this valuable tip!
Author: Christoph Mitasch Christoph Mitasch works in the Web Operations & Knowledge Transfer team at Thomas-Krenn. He is responsible for the maintenance and further development of the webshop infrastructure. After an internship at IBM Linz, he finished his diploma studies "Computer- and Media-Security" at FH Hagenberg. He lives near Linz and beside working, he is an enthusiastic marathon runner and juggler, where he hold various world-records.
|