Monitoring of a Proxmox VE Ceph Hosts with checkmk

From Thomas-Krenn-Wiki
Jump to navigation Jump to search

The software checkmk is a tool for the supervision of server systems. The monitoring of Proxmox VE Ceph hosts is also possible with checkmk. This article is an instruction for the installation and launch of checkmk exactly for this purpose.

Requirements

small

To supervise a Proxmox VE Ceph with checkmk, the following components are required:

  • a Linux server installation (Debian preferred)
  • a complete Docker installation[1]
  • a ready for use checkmk-RAW Container[2][3]
  • a Proxmox Ceph HCI Cluster
  • a created and registrated Proxmox VE Host in checkmk

Installation and launch of checkmk

If the mentioned requirements are fulfilled, the installation of "checkmk" on Ceph hosts can be started.

Agent mk_ceph

The first step is to create an mk_ceph agent. This requires a few settings in the "checkmk" user interface:

  1. First, switch to agents in the menu under setup and select Linux
  2. Now, copy with a right click and the selection "copy link" for mk_ceph the plugin path http://10.2.1.179:8006/cmk/check_mk/agents/plugins/mk_ceph


Agent installation on Ceph hosts

To install the agents, the following commands must now be entered on each Ceph host:

cd /usr/lib/check_mk_agent/plugins
wget http://10.2.1.179:8006/cmk/check_mk/agents/plugins/mk_ceph

However, the file has to be executable, otherwise the checks will not work. Please use the following command to do this

chmod +x mk_ceph

Adjustment of configuration file on Ceph hosts

The configuration file /usr/lib/check_mk_agent/plugins/mk_ceph has to be adjusted as follows:

USER=client.admin
KEYRING=/etc/pve/priv/ceph.client.admin.keyring

The rest of the file may remain unchanged.

Activation of Ceph checks

To activate the checks, a service discovery has to be performed once again in the checkmk surface. After this, the checks are functional:

MON OSD Ratio check

We recommend the supervision of mon_osd_fullratio and the setting of a suitable nearfull-ratio.

Digression - mon_osd_fullratio

The mon_osd_fullratio configuration parameter in Ceph determines a threshold value in per cent for the available capacity of an OSD data carrier. If this threshold value is reached, no more new data is written to this OSD. This prevents the data carrier from being written to in full.

In Proxmox VE, mon_osd_fullratio is adjusted to 95 per cent by default.

If an OSD reaches this threshold value in a Ceph pool, Ceph switches the full OSDs and the associated pool in a "read only" mode. As a result, data loss will be avoided. It is important to supervise the pool and to react early to avoid such a scenario.

Therefore, we recommend expanding the pool with new data carriers from an occupancy rate of 60%!

The data will be redistributed and the percentual occupancy rate of individual OSDs will be reduced through this. Therefore, the availability of the pool is ensured and downtime is avoided.

The need and the expected occupancy rate of OSDs should be considered during the planning of the hosts. Please take into consideration failure scenarios. We are pleased to help you with that. You will receive a warning of a fixed percentage by setting a nearfull-ratio. This allows you to act without restriction and prevents your system from losing functionality.

The actual OSD ratios can be found out as follows:

root@PMX4:~#
ceph osd dump | grep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.75

After this, a suitable value has to be set:

root@PMX4:~# ceph osd set-nearfull-ratio 0.6
osd set-nearfull-ratio 0.6

If an OSD reaches an occupancy rate of 60 per cent, a Ceph warning will be triggered. This will also show up via checkmk in the Ceph health check and will correctly warn of a potential threatening downtime.

References

Author: Jonas Sterr

Jonas Sterr has been working for Thomas-Krenn for several years. Originally employed as a trainee in technical support and then in hosting (formerly Filoo), Mr. Sterr now mainly deals with the topics of storage (SDS / Huawei / Netapp), virtualization (VMware, Proxmox, HyperV) and network (switches, firewalls) in product management at Thomas-Krenn.AG in Freyung.


Translator: Alina Ranzinger

Alina has been working at Thomas-Krenn.AG since 2024. After her training as multilingual business assistant, she got her job as assistant of the Product Management and is responsible for the translation of texts and for the organisation of the department.


Related articles

Ceph Perfomance Guide - Sizing & Testing
Ceph Recovery Stop in the event of node failure
Ceph: a password is required command=nvme error