VMware ESXi Hardware mit Nagios oder Icinga überwachen
VMware vSphere 5 bietet eine integrierte Überwachung der vorhandenen Serverhardware mittels entsprechender CIM Provider. Das Plugin check_esxi_hardware.py ermöglicht die einfache Überwachung dieser Hardwarekomponenten mittels Nagios oder Icinga.
Inhaltsverzeichnis |
Anforderungen CIM Provider
Der CIM Provider muss die Informationen zum Hardwarestatus an ESXi weitergeben. Dies ist beispielsweise beim CIM Provider für LSI RAID Controller der Fall:
Hinweis: Der CIM Provider für Adaptec RAID Controller eignet sich dazu nicht (siehe Adaptec RAID Controller in VMware überwachen - Installation CIM Provider und aacraid Treiber)
Plugin
Das Plugin steht auf folgender Webseite zum Download bereit:
Informationen auf exchange.nagios.org:
Verwendung des Plugins
Die Funktionsweise des Plugins wurde mit einem Thomas Krenn Servers mit einem X8DT3 Mainboard getestet. Auf diesem Server wurde ESXi 5.1 mit integriertem LSI CIM Provider installiert (wird von Thomas Krenn im Download Bereich zur Verfügung gestellt). Dadurch kann auch der Status des LSI 9260-4i RAID Controllers überwacht werden.
Für die Verwendung des Plugins muss Python sowie die Library pywbem installiert sein. Unter Debian/Ubuntu kann diese mittels
apt-get install python-pywbem
nachinstalliert werden.
Danach kann das Plugin auf der Kommandozeile getestet werden.
Folgende Parameter werden unter anderem vom Plugin unterstützt:
- -H ... IP Adresse des VMware ESXi Servers
- -U ... Username oder Pfad zur Username-Passwort Datei (file:/path/to/.file)
- -P ... Passwort oder Pfad zur Passwort-Datei (file:/path/to/.file)
- -v ... verbose, zeigt alle Sensoren an die abgefragt werden
Zum Testen verwenden wir den root User vom ESXi Server. In einem produktiven Umfeld sollte am vCenter Server ein eigener Benutzer angelegt werden, der nur die Berechtigung hat, die Sensoren auszulesen.
Das Plugin kann wie folgt aufgerufen werden:
python check_esxi_hardware.py -H 10.X.X.X -U root -P password WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) - \ Server: Supermicro X8DT3 s/n: 1234567890 System BIOS: 2.0a 2010-09-14 echo $? 1
In diesem Fall kommt eine Warnung, da die RAID Controller Batterie (BBU) noch nicht vollständig geladen ist.
Auf der Webseite des Plugins wird empfohlen das Passwort in einer Datei anzugeben. Dadurch scheint das Passwort nicht in der Prozessliste auf, während der Check ausgeführt wird. Es gibt dafür zwei Varianten.
- nur das Passwort in der Datei angeben
- python check_esxi_hardware.py -H 10.X.X.X -U root -P file:/path/to/.file
- Username und Passwort durch Leerzeichen getrennt in der Datei angeben
- python check_esxi_hardware.py -H 10.X.X.X -U file:/path/to/.file -P file:/path/to/.file
Interessant ist auch die Verwendung der Option "-v". Dadurch werden alle abgefragten Sensoren sowie deren Status Code angezeigt.
python check_esxi_hardware.py -H 10.1.102.143 -U tkmon -P relation -v 20130430 09:29:33 Connection to https://10.1.102.143 20130430 09:29:33 Check classe OMC_SMASHFirmwareIdentity 20130430 09:29:33 Element Name = System BIOS 20130430 09:29:33 VersionString = 2.0a 20130430 09:29:33 Check classe CIM_Chassis 20130430 09:29:33 Element Name = Chassis 20130430 09:29:33 Manufacturer = Supermicro 20130430 09:29:33 SerialNumber = 1234567890 20130430 09:29:33 Model = X8DT3 20130430 09:29:33 Element Op Status = 0 20130430 09:29:33 Check classe CIM_Card 20130430 09:29:34 Element Name = Motherboard 20130430 09:29:34 Element Op Status = 0 20130430 09:29:34 Check classe CIM_ComputerSystem 20130430 09:29:34 Element Name = System Board 7:1 20130430 09:29:34 Element Op Status = 0 20130430 09:29:34 Element Name = localhost 20130430 09:29:34 Element Name = Hardware Management Controller (Node 0) 20130430 09:29:34 Element Op Status = 0 20130430 09:29:34 Element Name = Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) 20130430 09:29:34 Element Op Status = 3 20130430 09:29:34 GLobal exit set to WARNING 20130430 09:29:34 Check classe CIM_NumericSensor 20130430 09:29:35 Element Name = Memory Device 12 P2-DIMM3B Temp 20130430 09:29:35 sensorType = 2 - Temperature 20130430 09:29:35 BaseUnits = 2 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 42.000000 20130430 09:29:35 Lower Threshold Non Critical = -5.000000 20130430 09:29:35 Upper Threshold Non Critical = 75.000000 20130430 09:29:35 Lower Threshold Critical = -7.000000 20130430 09:29:35 Upper Threshold Critical = 80.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Memory Device 11 P2-DIMM3A Temp 20130430 09:29:35 sensorType = 2 - Temperature 20130430 09:29:35 BaseUnits = 2 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 45.000000 20130430 09:29:35 Lower Threshold Non Critical = -5.000000 20130430 09:29:35 Upper Threshold Non Critical = 75.000000 20130430 09:29:35 Lower Threshold Critical = -7.000000 20130430 09:29:35 Upper Threshold Critical = 80.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Memory Device 10 P2-DIMM2B Temp 20130430 09:29:35 sensorType = 2 - Temperature 20130430 09:29:35 BaseUnits = 2 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 39.000000 20130430 09:29:35 Lower Threshold Non Critical = -5.000000 20130430 09:29:35 Upper Threshold Non Critical = 75.000000 20130430 09:29:35 Lower Threshold Critical = -7.000000 20130430 09:29:35 Upper Threshold Critical = 80.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Memory Device 9 P2-DIMM2A Temp 20130430 09:29:35 sensorType = 2 - Temperature 20130430 09:29:35 BaseUnits = 2 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 41.000000 20130430 09:29:35 Lower Threshold Non Critical = -5.000000 20130430 09:29:35 Upper Threshold Non Critical = 75.000000 20130430 09:29:35 Lower Threshold Critical = -7.000000 20130430 09:29:35 Upper Threshold Critical = 80.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Memory Device 8 P2-DIMM1B Temp 20130430 09:29:35 sensorType = 2 - Temperature 20130430 09:29:35 BaseUnits = 2 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 39.000000 20130430 09:29:35 Lower Threshold Non Critical = -5.000000 20130430 09:29:35 Upper Threshold Non Critical = 75.000000 20130430 09:29:35 Lower Threshold Critical = -7.000000 20130430 09:29:35 Upper Threshold Critical = 80.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Memory Device 7 P2-DIMM1A Temp 20130430 09:29:35 sensorType = 2 - Temperature 20130430 09:29:35 BaseUnits = 2 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 39.000000 20130430 09:29:35 Lower Threshold Non Critical = -5.000000 20130430 09:29:35 Upper Threshold Non Critical = 75.000000 20130430 09:29:35 Lower Threshold Critical = -7.000000 20130430 09:29:35 Upper Threshold Critical = 80.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Fan Device 8 Fan8 20130430 09:29:35 sensorType = 5 - Tachometer 20130430 09:29:35 BaseUnits = 19 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 1890.000000 20130430 09:29:35 Lower Threshold Non Critical = 675.000000 20130430 09:29:35 Upper Threshold Non Critical = 34155.000000 20130430 09:29:35 Lower Threshold Critical = 540.000000 20130430 09:29:35 Upper Threshold Critical = 34290.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Fan Device 7 Fan7 20130430 09:29:35 sensorType = 5 - Tachometer 20130430 09:29:35 BaseUnits = 19 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 1890.000000 20130430 09:29:35 Lower Threshold Non Critical = 675.000000 20130430 09:29:35 Upper Threshold Non Critical = 34155.000000 20130430 09:29:35 Lower Threshold Critical = 540.000000 20130430 09:29:35 Upper Threshold Critical = 34290.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Fan Device 5 Fan5 20130430 09:29:35 sensorType = 5 - Tachometer 20130430 09:29:35 BaseUnits = 19 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 945.000000 20130430 09:29:35 Lower Threshold Non Critical = 675.000000 20130430 09:29:35 Upper Threshold Non Critical = 34155.000000 20130430 09:29:35 Lower Threshold Critical = 540.000000 20130430 09:29:35 Upper Threshold Critical = 34290.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Fan Device 2 Fan2 20130430 09:29:35 sensorType = 5 - Tachometer 20130430 09:29:35 BaseUnits = 19 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 1080.000000 20130430 09:29:35 Lower Threshold Non Critical = 675.000000 20130430 09:29:35 Upper Threshold Non Critical = 34155.000000 20130430 09:29:35 Lower Threshold Critical = 540.000000 20130430 09:29:35 Upper Threshold Critical = 34290.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = Fan Device 1 Fan1 20130430 09:29:35 sensorType = 5 - Tachometer 20130430 09:29:35 BaseUnits = 19 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 945.000000 20130430 09:29:35 Lower Threshold Non Critical = 675.000000 20130430 09:29:35 Upper Threshold Non Critical = 34155.000000 20130430 09:29:35 Lower Threshold Critical = 540.000000 20130430 09:29:35 Upper Threshold Critical = 34290.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 VBAT 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 3.240000 20130430 09:29:35 Lower Threshold Non Critical = 2.920000 20130430 09:29:35 Upper Threshold Non Critical = 3.640000 20130430 09:29:35 Lower Threshold Critical = 2.900000 20130430 09:29:35 Upper Threshold Critical = 3.670000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 +12V 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 12.080000 20130430 09:29:35 Lower Threshold Non Critical = 10.700000 20130430 09:29:35 Upper Threshold Non Critical = 13.250000 20130430 09:29:35 Lower Threshold Critical = 10.650000 20130430 09:29:35 Upper Threshold Critical = 13.300000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 +5V 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 5.020000 20130430 09:29:35 Lower Threshold Non Critical = 4.480000 20130430 09:29:35 Upper Threshold Non Critical = 5.530000 20130430 09:29:35 Lower Threshold Critical = 4.440000 20130430 09:29:35 Upper Threshold Critical = 5.560000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 +3.3VSB 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 3.240000 20130430 09:29:35 Lower Threshold Non Critical = 2.920000 20130430 09:29:35 Upper Threshold Non Critical = 3.640000 20130430 09:29:35 Lower Threshold Critical = 2.900000 20130430 09:29:35 Upper Threshold Critical = 3.670000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 +3.3V 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 3.280000 20130430 09:29:35 Lower Threshold Non Critical = 2.920000 20130430 09:29:35 Upper Threshold Non Critical = 3.640000 20130430 09:29:35 Lower Threshold Critical = 2.900000 20130430 09:29:35 Upper Threshold Critical = 3.670000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 +1.5V 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 1.520000 20130430 09:29:35 Lower Threshold Non Critical = 1.330000 20130430 09:29:35 Upper Threshold Non Critical = 1.650000 20130430 09:29:35 Lower Threshold Critical = 1.320000 20130430 09:29:35 Upper Threshold Critical = 1.660000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 CPU2 DIMM 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 1.580000 20130430 09:29:35 Lower Threshold Non Critical = 1.190000 20130430 09:29:35 Upper Threshold Non Critical = 1.640000 20130430 09:29:35 Lower Threshold Critical = 1.190000 20130430 09:29:35 Upper Threshold Critical = 1.650000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 CPU2 Vcore 20130430 09:29:35 sensorType = 3 - Voltage 20130430 09:29:35 BaseUnits = 5 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 1.040000 20130430 09:29:35 Lower Threshold Non Critical = 0.820000 20130430 09:29:35 Upper Threshold Non Critical = 1.350000 20130430 09:29:35 Lower Threshold Critical = 0.810000 20130430 09:29:35 Upper Threshold Critical = 1.360000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Element Name = System Board 1 System Temp 20130430 09:29:35 sensorType = 2 - Temperature 20130430 09:29:35 BaseUnits = 2 20130430 09:29:35 Scaled by = 0.010000 20130430 09:29:35 Current Reading = 36.000000 20130430 09:29:35 Lower Threshold Non Critical = -5.000000 20130430 09:29:35 Upper Threshold Non Critical = 75.000000 20130430 09:29:35 Lower Threshold Critical = -7.000000 20130430 09:29:35 Upper Threshold Critical = 77.000000 20130430 09:29:35 Element Op Status = 2 20130430 09:29:35 Check classe CIM_Memory 20130430 09:29:35 Element Name = CPU 2 Level-1 Cache 20130430 09:29:35 Element Op Status = 0 20130430 09:29:35 Element Name = CPU 2 Level-2 Cache 20130430 09:29:35 Element Op Status = 0 20130430 09:29:35 Element Name = CPU 2 Level-3 Cache 20130430 09:29:35 Element Op Status = 0 20130430 09:29:35 Element Name = Memory 20130430 09:29:35 Check classe CIM_Processor 20130430 09:29:36 Element Name = CPU 2 20130430 09:29:36 Family = 179 20130430 09:29:36 CurrentClockSpeed = 1866MHz 20130430 09:29:36 Element Op Status = 2 20130430 09:29:36 Check classe CIM_RecordLog 20130430 09:29:36 Check classe OMC_DiscreteSensor 20130430 09:29:36 Element Name = Power Supply 1 PS Status: Failure status 20130430 09:29:36 Element Op Status = 2 20130430 09:29:36 Element Name = System Chassis 1 Intrusion: General Chassis intrusion 20130430 09:29:36 Element Op Status = 2 20130430 09:29:36 Element Name = Processor 2 CPU2 Temp 20130430 09:29:36 Check classe OMC_Fan 20130430 09:29:37 Element Name = Fan8 20130430 09:29:37 Element Op Status = 2 20130430 09:29:37 Element Name = Fan7 20130430 09:29:37 Element Op Status = 2 20130430 09:29:37 Element Name = Fan5 20130430 09:29:37 Element Op Status = 2 20130430 09:29:37 Element Name = Fan2 20130430 09:29:37 Element Op Status = 2 20130430 09:29:37 Element Name = Fan1 20130430 09:29:37 Element Op Status = 2 20130430 09:29:37 Check classe OMC_PowerSupply 20130430 09:29:37 Element Name = Power Supply 1 20130430 09:29:37 Element Op Status = 2 20130430 09:29:37 Check classe VMware_StorageExtent 20130430 09:29:38 Element Name = Drive 252_5 on controller 500605B00418BB20 Fw: n/a - UNCONFIGURED GOOD 20130430 09:29:38 Element Op Status = 2 20130430 09:29:38 Element Name = Drive 252_4 on controller 500605B00418BB20 Fw: n/a - UNCONFIGURED GOOD 20130430 09:29:38 Element Op Status = 2 20130430 09:29:38 Check classe VMware_Controller 20130430 09:29:38 Element Name = Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) 20130430 09:29:38 Element Op Status = 3 20130430 09:29:38 GLobal exit set to WARNING 20130430 09:29:38 Check classe VMware_StorageVolume 20130430 09:29:39 Element Name = RAID 1 StorageVolume Logical Volume 500605B00418BB20_0 on controller 500605B00418BB20, Drives( - OPTIMAL 20130430 09:29:39 Element Op Status = 2 20130430 09:29:39 Check classe VMware_Battery 20130430 09:29:39 Element Name = Battery 934 on Controller 500605B00418BB20 20130430 09:29:39 Element Op Status = 11 20130430 09:29:39 Check classe VMware_SASSATAPort 20130430 09:29:39 Element Name = Port 0 on Controller 500605B00418BB20 20130430 09:29:39 Element Op Status = 2 20130430 09:29:39 Element Name = Port 1 on Controller 500605B00418BB20 20130430 09:29:39 Element Op Status = 2 WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) -\ Server: Supermicro X8DT3 s/n: 1234567890 System BIOS: 2.0a 2010-09-14
Workaround für eingeschränkte Benutzerrechte ohne vCenter
Wenn kein vCenter Server vorhanden ist, kann wie hier beschrieben ein neuer Benutzer (z.b. "tkmon") mit read-only Rechten angelegt werden. Da unter ESXi 5.1 keine lokalen Gruppen mehr unterstützt werden, muss der User via SSH der Gruppe root hinzugefügt werden.
/etc/group
root:x:0:root,tkmon
Anschließend kann für diesen Benutzer noch der SSH Zugriff gesperrt werden durch setzen von /sbin/nologin als Login Shell.
/etc/passwd
tkmon:x:1000:1000:ESXi User:/:/sbin/nologin
Bei unseren Tests hat diese Vorgangsweise dazu geführt, dass die Sensoren ausgelesen werden können, ein SSH Login nicht möglich ist und im vSphere Client nur read-only Zugriff möglich ist.
Wir möchten jedoch ausdrücklich darauf hinweisen, dass diese Vorgangsweise weder von VMware noch von Thomas Krenn supportet werden kann.
Einbindung in Icinga
Es gibt verschiedene Varianten für die Definition eines Icinga Commands (commands.cfg). Die einfachste Form unter Debian/Ubuntu ist:
# 'check_esxi_hardware' command definition
define command{
command_name check_esxi_hardware
command_line /usr/lib/nagios/plugins/check_esxi_hardware.py -H $HOSTADDRESS$ -U $ARG1$ -P $ARG2$
}
Weitere Varianten finden Sie hier.
Credit
Herzlichen Dank an Sascha Peters für diesen wertvollen Tip!
Autor: Christoph Mitasch
