Ceph Perfomance Guide - Sizing & Testing

When setting up a new Proxmox VE Ceph cluster, many factors are relevant. Proper hardware sizing, the configuration of Ceph, as well as thorough testing of drives, the network, and the Ceph pool have a significant impact on the system's achievable performance. In this article, you will learn how to plan a Proxmox Ceph cluster. The article also assists with troubleshooting in case of Ceph performance issues.

Ceph Nodes, Ceph OSDs, Ceph Pool

The following terms are used in this article:

Nodes: the minimum number of nodes required for using Ceph is 3.
Drives: each of these nodes requires at least 4 storage drives (OSDs).
OSD: an OSD (Object Storage Daemon) is a process responsible for storing data on a drive assigned to the OSD.
Ceph Cluster: a cluster therefore consists of at least 3 nodes, each with 4 drives (OSDs), which together form the pool (= 12 disks).
Pool: the default replication factor for Proxmox VE Ceph pools is 3 (triple replication) (1 copy per host).

Performance Factors

OSD & Pool Configuration

The number of OSDs in a cluster impacts the overall performance of the cluster. More OSDs allow for higher parallel processing and load distribution. However, a higher number of OSDs also increases the requirements for network bandwidth and the cluster's resources (CPU/RAM).

Equally important is the number of nodes in the cluster. More nodes provide additional resources and space for distributing OSDs. This allows read and write operations on the storage to spread across multiple nodes. Naturally, this also affects the network bandwidth, as it increases with more nodes. Sufficient switch ports and appropriate switching capacity should be in place.

Another critical point is the pool setup. It affects the availability and data security of the data in the cluster. A pool consists of a group of OSDs (defined via Ceph device classes) that are responsible for data storage. The replication factor determines how often an object is replicated across multiple OSDs (on different hosts or even physical locations) using CRUSH. A higher replication factor improves data security but also requires more storage space.

Additionally, it increases latency for write operations, as replicas typically need to be written at least twice (for a replication factor of SIZE=3) before an acknowledgment (ACK) is sent to the storage client.

With too few OSDs, the load is concentrated on too few OSDs, which can lead to bottlenecks and poor performance. However, an excessive number of OSDs can also increase the demands on network bandwidth and cluster resources. It is essential to find the right balance and plan sufficient resources for the deployment.

Conclusion: It is important to carefully plan and monitor the number of OSDs, nodes, and the network for Ceph to maximize performance and availability of the Ceph cluster while avoiding pitfalls.

CPU & RAM

Depending on the number of Ceph drives, appropriate resources for CPU and RAM should be planned.^[1]^[2]^[3] The more OSDs are configured in the system, the more CPU cores and RAM are required.

Since sizing also depends on the selection and number of Ceph services (Ceph MON, Ceph MGR, Ceph MDS), we are happy to assist you with sizing your next Ceph or Proxmox Ceph HCI cluster.

Network

The most significant factor for Ceph performance is the network setup used for Ceph. Why is that? Many of the services relevant to Ceph are entirely network-based. The individual servers (nodes) communicate purely over the network and continuously exchange data and information with each other. They also write data replicas over the network to the respective drives (OSDs) of other nodes. Naturally, this has a major impact on the performance required for a high-performance Ceph network.

The following applies:

The higher the Gbit/s, the better, and
The lower the latency in the network, the better.
At least 10Gbit/s when using HDDs
For 10Gbit/s, fiber optics are naturally better than copper. For new installations, however, we recommend at least 25Gbit/s. Fiber optic latency can also be optimized by choosing the appropriate cable length for the intended use. At Thomas-Krenn, we recommend starting with a 3-node Ceph cluster using direct cabling. This saves on switch costs (since it’s directly cabled) and allows for maximum performance (4x25 Gbit/s) in Ceph.
At least 25Gbit/s when using NVMe SSDs (we currently test with network cards up to 100Gbit/s).

Additionally, the use of Jumbo Frames can improve Ceph performance.

Drives

The performance of a Ceph system is also significantly influenced by the performance of the drives used. When using SSDs, the following factors should be considered:

DWPD (Drive Writes Per Day): this value indicates how often a drive can be written to per day before it could potentially fail. A higher DWPD value means the drive is more durable.
TBW (Total Bytes Written): this value represents the total amount of data that can be written to a drive before it could potentially fail. A higher TBW value means the drive is more durable.
IOPS (Input/Output Operations Per Second): this value specifies the number of input and output operations per second that a drive can perform. Typically, block operations of 4 kB in size are considered. A higher IOPS value means that more operations can be performed per second. The queue depth, which describes how many I/O operations are performed in parallel by the test system, is also relevant.
Throughput: this value indicates the speed at which data can be written to or read from a drive. A higher throughput value means the drive is faster.
Latency: latency refers to the time a drive needs to fully respond to a request, both when writing and reading data.
PLP (Powerloss Protection): the drive's cache is fully written to the drive in the event of a power loss on the server, preventing data loss. This has the general advantage that the write-ACK (acknowledgment) can occur after writing to the cache, which can result in increased performance.

Performance Measurement

Drive Performance

To test the performance of an individual drive, the tool fio can be used. Fio allows different write and read operations to be performed on the drive and outputs the results in terms of IOPS and throughput. The following command can be used to test the drive /dev/nvme0 for IOPS using a block size of 4K:

fio --ioengine=libaio --filename=/dev/nvme0n1 --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio

Parameter	Explanation
ioengine=libaio	sets the I/O engine for fio
filename=/dev/nvme0n1	specifies the device to be tested
direct=1	enables direct I/O without cache (no page cache, no RAM)
sync=1	enables synchronous I/O mode (synchronous writes)
rw=write	sets the test to write mode
bs=4K	sets the block size of the test
numjobs=1	sets the number of parallel jobs
iodepth=1	sets the I/O depth per job
runtime=60	sets the duration of the test in seconds
time_based	enables time-based mode for the test
name=fio	sets the name of the test

This command runs a write test with a block size of 4K on the drive /dev/nvme0. The test runs for a duration of 60 seconds and outputs the results in terms of IOPS. Important: The following results were conducted analogously to the Proxmox Ceph Performance Paper by Proxmox Server Solutions GmbH. These tests are not optimized and represent the absolute minimal performance of a single drive since no parallelization (Number of Jobs = 1) and I/O depth were configured. We do this so you can compare your results to Proxmox's values.

The actual performance within the VM depends on many other factors and should not be compared to the values here, as there are numerous optimizations and improvements available.

Write

root@PMX1:~# fio --ioengine=libaio --filename=/dev/nvme0n1 --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=445MiB/s][w=114k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=2028745: Wed Feb  1 10:37:38 2023
  write: IOPS=114k, BW=444MiB/s (465MB/s)(26.0GiB/60001msec); 0 zone resets
    slat (nsec): min=1092, max=125185, avg=1238.68, stdev=769.33
    clat (nsec): min=251, max=859795, avg=7242.66, stdev=1484.45
     lat (usec): min=7, max=861, avg= 8.51, stdev= 1.70
    clat percentiles (usec):
     |  1.00th=[    8],  5.00th=[    8], 10.00th=[    8], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[    8], 50.00th=[    8], 60.00th=[    8],
     | 70.00th=[    8], 80.00th=[    8], 90.00th=[    8], 95.00th=[    8],
     | 99.00th=[    9], 99.50th=[    9], 99.90th=[   13], 99.95th=[   17],
     | 99.99th=[  118]
   bw (  KiB/s): min=433496, max=460096, per=100.00%, avg=454720.47, stdev=4372.66, samples=119
   iops        : min=108374, max=115024, avg=113680.13, stdev=1093.17, samples=119
  lat (nsec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=99.78%, 20=0.18%, 50=0.02%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=11.52%, sys=23.95%, ctx=6816318, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,6816793,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=444MiB/s (465MB/s), 444MiB/s-444MiB/s (465MB/s-465MB/s), io=26.0GiB (27.9GB), run=60001-60001msec

Disk stats (read/write):
  nvme0n1: ios=92/6805234, merge=0/0, ticks=2/45719, in_queue=45722, util=99.91%

Network Test with iperf3

Another critical factor that can influence Ceph performance is the network connection. To test the performance of the Ceph network, the tool iperf3 can be used. With iperf3, the bandwidth and latency of the network can be measured. The best way is to set up 2 iperf3 servers and connect to these iperf servers in parallel from a client to fully utilize the maximum network bandwidth. Important: If you do not achieve the desired bandwidth, increase the number of iperf3 servers and iperf3 clients.

# iperf server on Server 1 and Server 3
root@PMX1:~# iperf3 -s C-m
root@PMX3:~# iperf3 -s C-m

# Two parallel iperf clients on Server 2
root@PMX2:~# iperf3 -c 192.168.99.34 -t 3600 C-m 
root@PMX2:~# iperf3 -c 192.168.99.36 -t 3600 C-m

# Check the network load of the Ceph bond: bond1
root@PMX2:~# nload bond1 C-m

Ceph OSD Performance

Within Ceph, performance tests can be conducted at various levels. These tests are primarily useful during initial setup to verify correct configurations, but they can also be helpful for debugging performance issues.

ceph tell osd.x bench - Throughput

To test the performance of a single OSD drive, the tool tell osd bench can be used. The command ceph tell osd.X bench returns the performance data of the OSD drive with ID X.

root@PMX4:~#  ceph tell osd.1 bench
{
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.572842187,
    "bytes_per_sec": 1874411222.4402215,
    "iops": 446.89446030622042
}

The output includes the following information:

Parameter	Explanation
bytes_written	The number of bytes written during the test
blocksize	The block size used during the test
elapsed_sec	The duration of the test in seconds
bytes_per_sec	The average write rate in bytes per second
iops	The average number of write operations per second (IOPS)

The output shows that during the test, 1 GB (1,073,741,824 bytes) was written with a block size of 4 MB (4,194,304 bytes) in 0.57 seconds. This corresponds to an average write rate of 1,874,411,222 bytes (= 1.87 GB) per second and an average IOPS value of 446 (considering a block size of 4 MB).

ceph tell osd.x bench - IOPS

The command can, of course, also be run for IOPS with a block size of 4K:

root@PMX4:~# ceph tell osd.1 bench 12288000 4096
{
    "bytes_written": 12288000,
    "blocksize": 4096,
    "elapsed_sec": 0.043518787000000003,
    "bytes_per_sec": 282360811.20551449,
    "iops": 68935.744923221311
}

Here, due to the block size of 4K, a higher IOPS value is achieved: 68935 IOPS.

Ceph Pool Performance

Performance tests at the Ceph Pool level require multiple executions and parallelization to obtain a realistic total performance value. Specifically: a single performance test never reflects the overall performance of the system. For example, on a 3-node Ceph cluster, we run the RADOS-Bench Write test 6 to 8 times in parallel to fully utilize the Ceph network. Please keep this in mind when executing these commands.

A pool is a group of OSDs that collectively store and manage data. For an optimal pool, it is essential to select the correct number of PGs (Placement Groups). PGs are small groups within the OSDs responsible for data distribution and redundancy. Too few PGs can lead to poor data distribution, while too many PGs can place unnecessary strain on resources.

Important: To accurately measure performance in a Ceph cluster, you should execute the following commands multiple times and in parallel. For this, using tmux is ideal. Running the command individually does not reflect the total system performance, as no parallelized accesses to Ceph occur.

Excursus: PG Autoscaler: Using the PG Autoscaler, Ceph can automatically calculate the optimal number of PGs per pool, if desired. However, this is only recommended if you do not expect atypical data growth or data reduction in the system. This means you typically do not generate large amounts of data suddenly (hundreds of GB) or delete significant amounts of data.

Why? Because with constant data growth or reduction, the PG Autoscaler would frequently adjust the optimal number of PGs, increasing or decreasing them. This is problematic because increasing or reducing PGs always triggers data redistribution within Ceph. Constant data rebalancing would degrade cluster performance and strain the drives (impacting TBW).

For typical medium-sized enterprises with normal data growth, we recommend using the PG Autoscaler for simplicity.

rados bench (write) - Throughput

To test the performance of a Ceph pool, the tool rados bench can be used. With rados bench, write operations can be executed on the pool, and the results are output in terms of IOPS and throughput.

rados bench -p vm_nvme 600 write -t 16 --object_size=4MB --no-cleanup

Parameter	Explanation
-p vm_nvme	specifies the name of the pool where the test will run
600	sets the duration of the test in seconds (600 seconds equals 10 minutes)
write	specifies that the test performs write operations
-t 16	sets the number of parallel threads to be used
--object_size=4MB	sets the block size
--no-cleanup	ensures that the test data created is not deleted after the test (this is needed for read operations)

The results are output in bytes per second, and the test is performed with 16 parallel threads.

Total time run:         600.017
Total writes made:      383501
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     2556.6
Stddev Bandwidth:       1098.42
Max bandwidth (MB/sec): 4620
Min bandwidth (MB/sec): 1540
Average IOPS:           639
Stddev IOPS:            274.605
Max IOPS:               1155
Min IOPS:               385
Average Latency(s):     0.0250308
Stddev Latency(s):      0.0178465
Max latency(s):         0.15803
Min latency(s):         0.00460356

rados bench (read) - Throughput

Read operations can also be performed on the pool to test the read performance. The following command tests the read throughput of a Ceph pool with a block size of 4MB:

rados bench -p vm_nvme 60 read -t 16 --object_size=4MB

Explanation of Parameters for Ceph Rados Bench Read
Parameter	Function
-p vm_nvme	specifies the name of the pool where the test is executed.
60	sets the duration of the test in seconds (60 seconds equals 1 minute).
seq	specifies that the test performs sequential read operations.
-t 16	sets the number of parallel threads to be used.

This command runs a read throughput test on the pool vm_nvme, where sequential read operations are performed. The test runs for 60 seconds and outputs the results in bytes per second, using 16 parallel threads.

Total time run:       60.0276
Total reads made:     53525
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   3566.69
Average IOPS:         891
Stddev IOPS:          20.0448
Max IOPS:             935
Min IOPS:             842
Average Latency(s):   0.0176248
Max latency(s):       0.0879089
Min latency(s):       0.00350174

rados bench (write) - IOPS

tbd

rados bench (read) - IOPS

tbd

ceph osd latency

To measure individual OSD drives under Ceph, the command ceph osd perf can be used. This command provides detailed performance data for each OSD in the cluster. It displays the OSD ID and both commit and apply latency.

Commit Latency: The time an OSD needs to write the journal to the drive.
Apply Latency: The time an OSD needs to transfer a write operation to its local disk. This represents the time it takes for a write operation to be fully stored on the OSD.

High commit or apply latency can indicate that the OSD is overloaded and cannot write fast enough, impacting the performance of the entire Ceph cluster. Low commit and apply latency, on the other hand, indicate that the OSD is working correctly and the underlying drive is performing well.

Conclusion - Overview of Relevant Factors

In summary, the following factors influence Ceph performance:

Network Connection for the Ceph network (low latency, high bandwidth)
Use of appropriate Drives (flash, low latency, NVMe)
Resource requirements: RAM and CPU
Pool Configuration (number of PGs, replication factor)

Careful monitoring and measurement of these factors can help to optimize the performance of the Ceph system. For example, we recommend using checkMK.

References

↑ Intro to Ceph » hardware recommendations (docs.ceph.com/en/quincy)
↑ Production Setup (croit.io)
↑ Deploy Hyper-Converged Ceph Cluster (pve.proxmox.com/wiki)

Author: Jonas Sterr

Jonas Sterr has been working for Thomas-Krenn for several years. Originally employed as a trainee in technical support and then in hosting (formerly Filoo), Mr. Sterr now mainly deals with the topics of storage (SDS / Huawei / Netapp), virtualization (VMware, Proxmox, HyperV) and network (switches, firewalls) in product management at Thomas-Krenn.AG in Freyung.

Translator: Alina Ranzinger

Alina has been working at Thomas-Krenn.AG since 2024. After her training as multilingual business assistant, she got her job as assistant of the Product Management and is responsible for the translation of texts and for the organisation of the department.

[1] Intro to Ceph » hardware recommendations (docs.ceph.com/en/quincy)

[2] Production Setup (croit.io)

[3] Deploy Hyper-Converged Ceph Cluster (pve.proxmox.com/wiki)

[1]

[2]

[3]