Monitoring a Proxmox VE Ceph host with checkmk

Prerequisites

To monitor a Proxmox VE Ceph cluster using checkmk, the following components are required:

an installed Linux server (preferably Debian 12)
a complete installation of Docker
a fully functional checkMK-RAW container (free of charge)
a Proxmox Ceph HCI cluster with Ceph installed
a Proxmox VE host already created and registered in checkmk

Agent mk_ceph

In the checkmk UI, proceed as follows:

Setup -> Agents -> Linux
Copy the plugin path for mk_ceph (Right-click -> Copy link)
http://10.2.1.179:8006/cmk/check_mk/agents/plugins/mk_ceph

Agent Installation (Ceph Host)

On each Ceph host, proceed as follows:

cd /usr/lib/check_mk_agent/plugins
wget http://10.2.1.179:8006/cmk/check_mk/agents/plugins/mk_ceph
chmod +x mk_ceph

Config File (Ceph Host)

The configuration file under `/usr/lib/check_mk_agent/plugins/mk_ceph` must be adjusted as follows:

USER=client.admin
KEYRING=/etc/pve/priv/ceph.client.admin.keyring

The rest of the file must remain unchanged. The file must also be made executable, as otherwise the checks will not function: chmod +x mk_ceph

Activate Ceph Checks

To activate the checks, perform a **Service Discovery** in the checkmk interface again. After that, the checks are functional.

MON OSD Ratio Check

We recommend monitoring the `mon_osd_fullratio` and setting a suitable `nearfull-ratio`. Here is some background information:

The "mon osd full ratio" is a Ceph configuration parameter that determines at what percentage of an OSD disk's capacity Ceph OSDs are marked as "full." When this percentage is reached, no new data is written to the OSD to prevent the disk from becoming completely full. By default, the "mon osd full ratio" is set to 95% (Proxmox VE). If an OSD in a Ceph pool is full, it affects the pool's availability because Ceph switches both the full OSD and its associated pool to read-only mode to prevent data loss. Therefore, it is critical to monitor the pool and all associated OSDs to ensure they do not fill up.

It is also important to act quickly when an OSD approaches the defined MON-OSD-FULL-RATIO to ensure pool availability and avoid downtime. The best way to resolve this is by adding new OSDs, as the data will then be redistributed, reducing the percentage utilization of the OSD(s). We recommend ordering new drives as soon as utilization reaches 60%. Also, consider how many drive failures per host you can tolerate and calculate the maximum percentage utilization your OSDs should have. We are happy to assist you with this. By setting a `nearfull-ratio`, you receive a warning at a specific percentage, allowing you to take action without compromising system functionality.

The current OSD ratios can be found as follows:

root@PMX4:~#
ceph osd dump | grep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.75

Then set an appropriate value:

root@PMX4:~# ceph osd set-nearfull-ratio 0.6
osd set-nearfull-ratio 0.6

When an OSD reaches 60% usage, a Ceph warning is triggered, which also appears in Check-MK under the Ceph Health Check, correctly warning of a potentially imminent downtime.

Author: Jonas Sterr

Jonas Sterr has been working for Thomas-Krenn for several years. Originally employed as a trainee in technical support and then in hosting (formerly Filoo), Mr. Sterr now mainly deals with the topics of storage (SDS / Huawei / Netapp), virtualization (VMware, Proxmox, HyperV) and network (switches, firewalls) in product management at Thomas-Krenn.AG in Freyung.