VMware ESXi Hardware mit Nagios oder Icinga überwachen

Aus Thomas Krenn Wiki
Wechseln zu: Navigation, Suche

VMware vSphere 5 bietet eine integrierte Überwachung der vorhandenen Serverhardware mittels entsprechender CIM Provider. Das Plugin check_esxi_hardware.py ermöglicht die einfache Überwachung dieser Hardwarekomponenten mittels Nagios oder Icinga.

Inhaltsverzeichnis

Anforderungen CIM Provider

Der CIM Provider muss die Informationen zum Hardwarestatus an ESXi weitergeben. Dies ist beispielsweise beim CIM Provider für LSI RAID Controller der Fall:

Hinweis: Der CIM Provider für Adaptec RAID Controller eignet sich dazu nicht (siehe Adaptec RAID Controller in VMware überwachen - Installation CIM Provider und aacraid Treiber)

Plugin

Das Plugin steht auf folgender Webseite zum Download bereit:

Informationen auf exchange.nagios.org:

Verwendung des Plugins

Die Funktionsweise des Plugins wurde mit einem Thomas Krenn Servers mit einem X8DT3 Mainboard getestet. Auf diesem Server wurde ESXi 5.1 mit integriertem LSI CIM Provider installiert (wird von Thomas Krenn im Download Bereich zur Verfügung gestellt). Dadurch kann auch der Status des LSI 9260-4i RAID Controllers überwacht werden.

Für die Verwendung des Plugins muss Python sowie die Library pywbem installiert sein. Unter Debian/Ubuntu kann diese mittels

apt-get install python-pywbem

nachinstalliert werden.

Danach kann das Plugin auf der Kommandozeile getestet werden.

Folgende Parameter werden unter anderem vom Plugin unterstützt:

  • -H ... IP Adresse des VMware ESXi Servers
  • -U ... Username oder Pfad zur Username-Passwort Datei (file:/path/to/.file)
  • -P ... Passwort oder Pfad zur Passwort-Datei (file:/path/to/.file)
  • -v ... verbose, zeigt alle Sensoren an die abgefragt werden

Zum Testen verwenden wir den root User vom ESXi Server. In einem produktiven Umfeld sollte am vCenter Server ein eigener Benutzer angelegt werden, der nur die Berechtigung hat, die Sensoren auszulesen.

Das Plugin kann wie folgt aufgerufen werden:

python check_esxi_hardware.py -H 10.X.X.X -U root -P password
WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i)  WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) - \
Server: Supermicro X8DT3 s/n: 1234567890 System BIOS: 2.0a 2010-09-14
echo $?
1

In diesem Fall kommt eine Warnung, da die RAID Controller Batterie (BBU) noch nicht vollständig geladen ist.

Auf der Webseite des Plugins wird empfohlen das Passwort in einer Datei anzugeben. Dadurch scheint das Passwort nicht in der Prozessliste auf, während der Check ausgeführt wird. Es gibt dafür zwei Varianten.

  • nur das Passwort in der Datei angeben
    • python check_esxi_hardware.py -H 10.X.X.X -U root -P file:/path/to/.file
  • Username und Passwort durch Leerzeichen getrennt in der Datei angeben
    • python check_esxi_hardware.py -H 10.X.X.X -U file:/path/to/.file -P file:/path/to/.file

Interessant ist auch die Verwendung der Option "-v". Dadurch werden alle abgefragten Sensoren sowie deren Status Code angezeigt.

python check_esxi_hardware.py -H 10.1.102.143 -U tkmon -P relation -v
20130430 09:29:33 Connection to https://10.1.102.143
20130430 09:29:33 Check classe OMC_SMASHFirmwareIdentity
20130430 09:29:33   Element Name = System BIOS
20130430 09:29:33     VersionString = 2.0a
20130430 09:29:33 Check classe CIM_Chassis
20130430 09:29:33   Element Name = Chassis
20130430 09:29:33     Manufacturer = Supermicro
20130430 09:29:33     SerialNumber = 1234567890
20130430 09:29:33     Model = X8DT3
20130430 09:29:33     Element Op Status = 0
20130430 09:29:33 Check classe CIM_Card
20130430 09:29:34   Element Name = Motherboard
20130430 09:29:34     Element Op Status = 0
20130430 09:29:34 Check classe CIM_ComputerSystem
20130430 09:29:34   Element Name = System Board 7:1
20130430 09:29:34     Element Op Status = 0
20130430 09:29:34   Element Name = localhost
20130430 09:29:34   Element Name = Hardware Management Controller (Node 0)
20130430 09:29:34     Element Op Status = 0
20130430 09:29:34   Element Name = Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i)
20130430 09:29:34     Element Op Status = 3
20130430 09:29:34 GLobal exit set to WARNING
20130430 09:29:34 Check classe CIM_NumericSensor
20130430 09:29:35   Element Name = Memory Device 12 P2-DIMM3B Temp
20130430 09:29:35     sensorType = 2 - Temperature
20130430 09:29:35     BaseUnits = 2
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 42.000000
20130430 09:29:35     Lower Threshold Non Critical = -5.000000
20130430 09:29:35     Upper Threshold Non Critical = 75.000000
20130430 09:29:35     Lower Threshold Critical = -7.000000
20130430 09:29:35     Upper Threshold Critical = 80.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Memory Device 11 P2-DIMM3A Temp
20130430 09:29:35     sensorType = 2 - Temperature
20130430 09:29:35     BaseUnits = 2
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 45.000000
20130430 09:29:35     Lower Threshold Non Critical = -5.000000
20130430 09:29:35     Upper Threshold Non Critical = 75.000000
20130430 09:29:35     Lower Threshold Critical = -7.000000
20130430 09:29:35     Upper Threshold Critical = 80.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Memory Device 10 P2-DIMM2B Temp
20130430 09:29:35     sensorType = 2 - Temperature
20130430 09:29:35     BaseUnits = 2
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 39.000000
20130430 09:29:35     Lower Threshold Non Critical = -5.000000
20130430 09:29:35     Upper Threshold Non Critical = 75.000000
20130430 09:29:35     Lower Threshold Critical = -7.000000
20130430 09:29:35     Upper Threshold Critical = 80.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Memory Device 9 P2-DIMM2A Temp
20130430 09:29:35     sensorType = 2 - Temperature
20130430 09:29:35     BaseUnits = 2
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 41.000000
20130430 09:29:35     Lower Threshold Non Critical = -5.000000
20130430 09:29:35     Upper Threshold Non Critical = 75.000000
20130430 09:29:35     Lower Threshold Critical = -7.000000
20130430 09:29:35     Upper Threshold Critical = 80.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Memory Device 8 P2-DIMM1B Temp
20130430 09:29:35     sensorType = 2 - Temperature
20130430 09:29:35     BaseUnits = 2
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 39.000000
20130430 09:29:35     Lower Threshold Non Critical = -5.000000
20130430 09:29:35     Upper Threshold Non Critical = 75.000000
20130430 09:29:35     Lower Threshold Critical = -7.000000
20130430 09:29:35     Upper Threshold Critical = 80.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Memory Device 7 P2-DIMM1A Temp
20130430 09:29:35     sensorType = 2 - Temperature
20130430 09:29:35     BaseUnits = 2
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 39.000000
20130430 09:29:35     Lower Threshold Non Critical = -5.000000
20130430 09:29:35     Upper Threshold Non Critical = 75.000000
20130430 09:29:35     Lower Threshold Critical = -7.000000
20130430 09:29:35     Upper Threshold Critical = 80.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Fan Device 8 Fan8
20130430 09:29:35     sensorType = 5 - Tachometer
20130430 09:29:35     BaseUnits = 19
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 1890.000000
20130430 09:29:35     Lower Threshold Non Critical = 675.000000
20130430 09:29:35     Upper Threshold Non Critical = 34155.000000
20130430 09:29:35     Lower Threshold Critical = 540.000000
20130430 09:29:35     Upper Threshold Critical = 34290.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Fan Device 7 Fan7
20130430 09:29:35     sensorType = 5 - Tachometer
20130430 09:29:35     BaseUnits = 19
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 1890.000000
20130430 09:29:35     Lower Threshold Non Critical = 675.000000
20130430 09:29:35     Upper Threshold Non Critical = 34155.000000
20130430 09:29:35     Lower Threshold Critical = 540.000000
20130430 09:29:35     Upper Threshold Critical = 34290.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Fan Device 5 Fan5
20130430 09:29:35     sensorType = 5 - Tachometer
20130430 09:29:35     BaseUnits = 19
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 945.000000
20130430 09:29:35     Lower Threshold Non Critical = 675.000000
20130430 09:29:35     Upper Threshold Non Critical = 34155.000000
20130430 09:29:35     Lower Threshold Critical = 540.000000
20130430 09:29:35     Upper Threshold Critical = 34290.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Fan Device 2 Fan2
20130430 09:29:35     sensorType = 5 - Tachometer
20130430 09:29:35     BaseUnits = 19
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 1080.000000
20130430 09:29:35     Lower Threshold Non Critical = 675.000000
20130430 09:29:35     Upper Threshold Non Critical = 34155.000000
20130430 09:29:35     Lower Threshold Critical = 540.000000
20130430 09:29:35     Upper Threshold Critical = 34290.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = Fan Device 1 Fan1
20130430 09:29:35     sensorType = 5 - Tachometer
20130430 09:29:35     BaseUnits = 19
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 945.000000
20130430 09:29:35     Lower Threshold Non Critical = 675.000000
20130430 09:29:35     Upper Threshold Non Critical = 34155.000000
20130430 09:29:35     Lower Threshold Critical = 540.000000
20130430 09:29:35     Upper Threshold Critical = 34290.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 VBAT
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 3.240000
20130430 09:29:35     Lower Threshold Non Critical = 2.920000
20130430 09:29:35     Upper Threshold Non Critical = 3.640000
20130430 09:29:35     Lower Threshold Critical = 2.900000
20130430 09:29:35     Upper Threshold Critical = 3.670000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 +12V
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 12.080000
20130430 09:29:35     Lower Threshold Non Critical = 10.700000
20130430 09:29:35     Upper Threshold Non Critical = 13.250000
20130430 09:29:35     Lower Threshold Critical = 10.650000
20130430 09:29:35     Upper Threshold Critical = 13.300000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 +5V
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 5.020000
20130430 09:29:35     Lower Threshold Non Critical = 4.480000
20130430 09:29:35     Upper Threshold Non Critical = 5.530000
20130430 09:29:35     Lower Threshold Critical = 4.440000
20130430 09:29:35     Upper Threshold Critical = 5.560000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 +3.3VSB
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 3.240000
20130430 09:29:35     Lower Threshold Non Critical = 2.920000
20130430 09:29:35     Upper Threshold Non Critical = 3.640000
20130430 09:29:35     Lower Threshold Critical = 2.900000
20130430 09:29:35     Upper Threshold Critical = 3.670000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 +3.3V
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 3.280000
20130430 09:29:35     Lower Threshold Non Critical = 2.920000
20130430 09:29:35     Upper Threshold Non Critical = 3.640000
20130430 09:29:35     Lower Threshold Critical = 2.900000
20130430 09:29:35     Upper Threshold Critical = 3.670000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 +1.5V
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 1.520000
20130430 09:29:35     Lower Threshold Non Critical = 1.330000
20130430 09:29:35     Upper Threshold Non Critical = 1.650000
20130430 09:29:35     Lower Threshold Critical = 1.320000
20130430 09:29:35     Upper Threshold Critical = 1.660000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 CPU2 DIMM
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 1.580000
20130430 09:29:35     Lower Threshold Non Critical = 1.190000
20130430 09:29:35     Upper Threshold Non Critical = 1.640000
20130430 09:29:35     Lower Threshold Critical = 1.190000
20130430 09:29:35     Upper Threshold Critical = 1.650000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 CPU2 Vcore
20130430 09:29:35     sensorType = 3 - Voltage
20130430 09:29:35     BaseUnits = 5
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 1.040000
20130430 09:29:35     Lower Threshold Non Critical = 0.820000
20130430 09:29:35     Upper Threshold Non Critical = 1.350000
20130430 09:29:35     Lower Threshold Critical = 0.810000
20130430 09:29:35     Upper Threshold Critical = 1.360000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35   Element Name = System Board 1 System Temp
20130430 09:29:35     sensorType = 2 - Temperature
20130430 09:29:35     BaseUnits = 2
20130430 09:29:35     Scaled by = 0.010000 
20130430 09:29:35     Current Reading = 36.000000
20130430 09:29:35     Lower Threshold Non Critical = -5.000000
20130430 09:29:35     Upper Threshold Non Critical = 75.000000
20130430 09:29:35     Lower Threshold Critical = -7.000000
20130430 09:29:35     Upper Threshold Critical = 77.000000
20130430 09:29:35     Element Op Status = 2
20130430 09:29:35 Check classe CIM_Memory
20130430 09:29:35   Element Name = CPU 2 Level-1 Cache
20130430 09:29:35     Element Op Status = 0
20130430 09:29:35   Element Name = CPU 2 Level-2 Cache
20130430 09:29:35     Element Op Status = 0
20130430 09:29:35   Element Name = CPU 2 Level-3 Cache
20130430 09:29:35     Element Op Status = 0
20130430 09:29:35   Element Name = Memory
20130430 09:29:35 Check classe CIM_Processor
20130430 09:29:36   Element Name = CPU 2
20130430 09:29:36     Family = 179
20130430 09:29:36     CurrentClockSpeed = 1866MHz
20130430 09:29:36     Element Op Status = 2
20130430 09:29:36 Check classe CIM_RecordLog
20130430 09:29:36 Check classe OMC_DiscreteSensor
20130430 09:29:36   Element Name = Power Supply 1 PS Status: Failure status
20130430 09:29:36     Element Op Status = 2
20130430 09:29:36   Element Name = System Chassis 1 Intrusion: General Chassis intrusion
20130430 09:29:36     Element Op Status = 2
20130430 09:29:36   Element Name = Processor 2 CPU2 Temp
20130430 09:29:36 Check classe OMC_Fan
20130430 09:29:37   Element Name = Fan8
20130430 09:29:37     Element Op Status = 2
20130430 09:29:37   Element Name = Fan7
20130430 09:29:37     Element Op Status = 2
20130430 09:29:37   Element Name = Fan5
20130430 09:29:37     Element Op Status = 2
20130430 09:29:37   Element Name = Fan2
20130430 09:29:37     Element Op Status = 2
20130430 09:29:37   Element Name = Fan1
20130430 09:29:37     Element Op Status = 2
20130430 09:29:37 Check classe OMC_PowerSupply
20130430 09:29:37   Element Name = Power Supply 1
20130430 09:29:37     Element Op Status = 2
20130430 09:29:37 Check classe VMware_StorageExtent
20130430 09:29:38   Element Name = Drive 252_5 on controller 500605B00418BB20 Fw: n/a - UNCONFIGURED GOOD
20130430 09:29:38     Element Op Status = 2
20130430 09:29:38   Element Name = Drive 252_4 on controller 500605B00418BB20 Fw: n/a - UNCONFIGURED GOOD
20130430 09:29:38     Element Op Status = 2
20130430 09:29:38 Check classe VMware_Controller
20130430 09:29:38   Element Name = Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i)
20130430 09:29:38     Element Op Status = 3
20130430 09:29:38 GLobal exit set to WARNING
20130430 09:29:38 Check classe VMware_StorageVolume
20130430 09:29:39   Element Name = RAID 1 StorageVolume Logical Volume 500605B00418BB20_0 on controller 500605B00418BB20, Drives( - OPTIMAL
20130430 09:29:39     Element Op Status = 2
20130430 09:29:39 Check classe VMware_Battery
20130430 09:29:39   Element Name = Battery 934 on Controller 500605B00418BB20
20130430 09:29:39     Element Op Status = 11
20130430 09:29:39 Check classe VMware_SASSATAPort
20130430 09:29:39   Element Name = Port 0 on Controller 500605B00418BB20
20130430 09:29:39     Element Op Status = 2
20130430 09:29:39   Element Name = Port 1 on Controller 500605B00418BB20
20130430 09:29:39     Element Op Status = 2
 WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i)  WARNING : Controller 500605B00418BB20 (LSI MegaRAID SAS 9260-4i) -\
 Server: Supermicro X8DT3 s/n: 1234567890 System BIOS: 2.0a 2010-09-14

Workaround für eingeschränkte Benutzerrechte ohne vCenter

Wenn kein vCenter Server vorhanden ist, kann wie hier beschrieben ein neuer Benutzer (z.b. "tkmon") mit read-only Rechten angelegt werden. Da unter ESXi 5.1 keine lokalen Gruppen mehr unterstützt werden, muss der User via SSH der Gruppe root hinzugefügt werden.

/etc/group

root:x:0:root,tkmon

Anschließend kann für diesen Benutzer noch der SSH Zugriff gesperrt werden durch setzen von /sbin/nologin als Login Shell.

/etc/passwd

tkmon:x:1000:1000:ESXi User:/:/sbin/nologin

Bei unseren Tests hat diese Vorgangsweise dazu geführt, dass die Sensoren ausgelesen werden können, ein SSH Login nicht möglich ist und im vSphere Client nur read-only Zugriff möglich ist.

Wir möchten jedoch ausdrücklich darauf hinweisen, dass diese Vorgangsweise weder von VMware noch von Thomas Krenn supportet werden kann.

Einbindung in Icinga

Es gibt verschiedene Varianten für die Definition eines Icinga Commands (commands.cfg). Die einfachste Form unter Debian/Ubuntu ist:

# 'check_esxi_hardware' command definition
define command{
command_name check_esxi_hardware
command_line /usr/lib/nagios/plugins/check_esxi_hardware.py -H $HOSTADDRESS$ -U $ARG1$ -P $ARG2$
}

Weitere Varianten finden Sie hier.

Credit

Herzlichen Dank an Sascha Peters für diesen wertvollen Tip!


Autor: Christoph Mitasch


Share/Save/Bookmark  Feedback zu diesem Artikel geben
Meine Werkzeuge
Namensräume

Varianten
Aktionen
Navigation
Kategorien
Drucken/exportieren
Werkzeuge