Ceph Perfomance Guide - Sizing & Testing
When setting up a new Proxmox VE Ceph cluster, many factors are relevant. Proper hardware sizing, the configuration of Ceph, as well as thorough testing of drives, the network, and the Ceph pool have a significant impact on the system's achievable performance. In this article, you will learn how to plan a Proxmox Ceph cluster. The article also assists with troubleshooting in case of Ceph performance issues.
Ceph Nodes, Ceph OSDs, Ceph Pool
The following terms are used in this article:
Nodes: the minimum number of nodes required for using Ceph is 3.Drives: each of these nodes requires at least 4 storage drives (OSDs).OSD: an OSD (Object Storage Daemon) is a process responsible for storing data on a drive assigned to the OSD.Ceph Cluster: a cluster therefore consists of at least 3 nodes, each with 4 drives (OSDs), which together form the pool (= 12 disks).Pool: the default replication factor for Proxmox VE Ceph pools is 3 (triple replication) (1 copy per host).
Performance Factors
OSD & Pool Configuration
The number of OSDs in a cluster impacts the overall performance of the cluster. More OSDs allow for higher parallel processing and load distribution. However, a higher number of OSDs also increases the requirements for network bandwidth and the cluster's resources (CPU/RAM).
Equally important is the number of nodes in the cluster. More nodes provide additional resources and space for distributing OSDs. This allows read and write operations on the storage to spread across multiple nodes. Naturally, this also affects the network bandwidth, as it increases with more nodes. Sufficient switch ports and appropriate switching capacity should be in place.
Another critical point is the pool setup. It affects the availability and data security of the data in the cluster. A pool consists of a group of OSDs (defined via Ceph device classes) that are responsible for data storage. The replication factor determines how often an object is replicated across multiple OSDs (on different hosts or even physical locations) using CRUSH. A higher replication factor improves data security but also requires more storage space.
Additionally, it increases latency for write operations, as replicas typically need to be written at least twice (for a replication factor of SIZE=3) before an acknowledgment (ACK) is sent to the storage client.
With too few OSDs, the load is concentrated on too few OSDs, which can lead to bottlenecks and poor performance. However, an excessive number of OSDs can also increase the demands on network bandwidth and cluster resources. It is essential to find the right balance and plan sufficient resources for the deployment.
Conclusion: It is important to carefully plan and monitor the number of OSDs, nodes, and the network for Ceph to maximize performance and availability of the Ceph cluster while avoiding pitfalls.
CPU & RAM
Depending on the number of Ceph drives, appropriate resources for CPU and RAM should be planned.[1][2][3] The more OSDs are configured in the system, the more CPU cores and RAM are required.
Since sizing also depends on the selection and number of Ceph services (Ceph MON, Ceph MGR, Ceph MDS), we are happy to assist you with sizing your next Ceph or Proxmox Ceph HCI cluster.
Network
The most significant factor for Ceph performance is the network setup used for Ceph. Why is that? Many of the services relevant to Ceph are entirely network-based. The individual servers (nodes) communicate purely over the network and continuously exchange data and information with each other. They also write data replicas over the network to the respective drives (OSDs) of other nodes. Naturally, this has a major impact on the performance required for a high-performance Ceph network.
The following applies:
- The higher the Gbit/s, the better, and
- The lower the latency in the network, the better.
- At least 10Gbit/s when using HDDs
- For 10Gbit/s, fiber optics are naturally better than copper. For new installations, however, we recommend at least 25Gbit/s. Fiber optic latency can also be optimized by choosing the appropriate cable length for the intended use. At Thomas-Krenn, we recommend starting with a 3-node Ceph cluster using direct cabling. This saves on switch costs (since it’s directly cabled) and allows for maximum performance (4x25 Gbit/s) in Ceph.
- At least 25Gbit/s when using NVMe SSDs (we currently test with network cards up to 100Gbit/s).
Additionally, the use of Jumbo Frames can improve Ceph performance.
Drives
The performance of a Ceph system is also significantly influenced by the performance of the drives used. When using SSDs, the following factors should be considered:
DWPD(Drive Writes Per Day): this value indicates how often a drive can be written to per day before it could potentially fail. A higher DWPD value means the drive is more durable.TBW(Total Bytes Written): this value represents the total amount of data that can be written to a drive before it could potentially fail. A higher TBW value means the drive is more durable.IOPS(Input/Output Operations Per Second): this value specifies the number of input and output operations per second that a drive can perform. Typically, block operations of 4 kB in size are considered. A higher IOPS value means that more operations can be performed per second. The queue depth, which describes how many I/O operations are performed in parallel by the test system, is also relevant.Throughput: this value indicates the speed at which data can be written to or read from a drive. A higher throughput value means the drive is faster.Latency: latency refers to the time a drive needs to fully respond to a request, both when writing and reading data.PLP(Powerloss Protection): the drive's cache is fully written to the drive in the event of a power loss on the server, preventing data loss. This has the general advantage that the write-ACK (acknowledgment) can occur after writing to the cache, which can result in increased performance.
Performance Measurement
Drive Performance
To test the performance of an individual drive, the tool fio can be used. Fio allows different write and read operations to be performed on the drive and outputs the results in terms of IOPS and throughput. The following command can be used to test the drive /dev/nvme0 for IOPS using a block size of 4K:
fio --ioengine=libaio --filename=/dev/nvme0n1 --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio
| Parameter | Explanation |
|---|---|
| ioengine=libaio | sets the I/O engine for fio |
| filename=/dev/nvme0n1 | specifies the device to be tested |
| direct=1 | enables direct I/O without cache (no page cache, no RAM) |
| sync=1 | enables synchronous I/O mode (synchronous writes) |
| rw=write | sets the test to write mode |
| bs=4K | sets the block size of the test |
| numjobs=1 | sets the number of parallel jobs |
| iodepth=1 | sets the I/O depth per job |
| runtime=60 | sets the duration of the test in seconds |
| time_based | enables time-based mode for the test |
| name=fio | sets the name of the test |
This command runs a write test with a block size of 4K on the drive /dev/nvme0. The test runs for a duration of 60 seconds and outputs the results in terms of IOPS. Important: The following results were conducted analogously to the Proxmox Ceph Performance Paper by Proxmox Server Solutions GmbH. These tests are not optimized and represent the absolute minimal performance of a single drive since no parallelization (Number of Jobs = 1) and I/O depth were configured. We do this so you can compare your results to Proxmox's values.
The actual performance within the VM depends on many other factors and should not be compared to the values here, as there are numerous optimizations and improvements available.
Write
root@PMX1:~# fio --ioengine=libaio --filename=/dev/nvme0n1 --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=445MiB/s][w=114k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=2028745: Wed Feb 1 10:37:38 2023
write: IOPS=114k, BW=444MiB/s (465MB/s)(26.0GiB/60001msec); 0 zone resets
slat (nsec): min=1092, max=125185, avg=1238.68, stdev=769.33
clat (nsec): min=251, max=859795, avg=7242.66, stdev=1484.45
lat (usec): min=7, max=861, avg= 8.51, stdev= 1.70
clat percentiles (usec):
| 1.00th=[ 8], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 8],
| 30.00th=[ 8], 40.00th=[ 8], 50.00th=[ 8], 60.00th=[ 8],
| 70.00th=[ 8], 80.00th=[ 8], 90.00th=[ 8], 95.00th=[ 8],
| 99.00th=[ 9], 99.50th=[ 9], 99.90th=[ 13], 99.95th=[ 17],
| 99.99th=[ 118]
bw ( KiB/s): min=433496, max=460096, per=100.00%, avg=454720.47, stdev=4372.66, samples=119
iops : min=108374, max=115024, avg=113680.13, stdev=1093.17, samples=119
lat (nsec) : 500=0.01%, 750=0.01%, 1000=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 10=99.78%, 20=0.18%, 50=0.02%
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
cpu : usr=11.52%, sys=23.95%, ctx=6816318, majf=0, minf=14
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,6816793,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=444MiB/s (465MB/s), 444MiB/s-444MiB/s (465MB/s-465MB/s), io=26.0GiB (27.9GB), run=60001-60001msec
Disk stats (read/write):
nvme0n1: ios=92/6805234, merge=0/0, ticks=2/45719, in_queue=45722, util=99.91%
Network Test with iperf3
Another critical factor that can influence Ceph performance is the network connection. To test the performance of the Ceph network, the tool iperf3 can be used. With iperf3, the bandwidth and latency of the network can be measured. The best way is to set up 2 iperf3 servers and connect to these iperf servers in parallel from a client to fully utilize the maximum network bandwidth. Important: If you do not achieve the desired bandwidth, increase the number of iperf3 servers and iperf3 clients.
# iperf server on Server 1 and Server 3 root@PMX1:~# iperf3 -s C-m root@PMX3:~# iperf3 -s C-m # Two parallel iperf clients on Server 2 root@PMX2:~# iperf3 -c 192.168.99.34 -t 3600 C-m root@PMX2:~# iperf3 -c 192.168.99.36 -t 3600 C-m # Check the network load of the Ceph bond: bond1 root@PMX2:~# nload bond1 C-m
Ceph OSD Performance
Within Ceph, performance tests can be conducted at various levels. These tests are primarily useful during initial setup to verify correct configurations, but they can also be helpful for debugging performance issues.
ceph tell osd.x bench - Throughput
To test the performance of a single OSD drive, the tool tell osd bench can be used. The command ceph tell osd.X bench returns the performance data of the OSD drive with ID X.
root@PMX4:~# ceph tell osd.1 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.572842187,
"bytes_per_sec": 1874411222.4402215,
"iops": 446.89446030622042
}
The output includes the following information:
| Parameter | Explanation |
|---|---|
| bytes_written | The number of bytes written during the test |
| blocksize | The block size used during the test |
| elapsed_sec | The duration of the test in seconds |
| bytes_per_sec | The average write rate in bytes per second |
| iops | The average number of write operations per second (IOPS) |
The output shows that during the test, 1 GB (1,073,741,824 bytes) was written with a block size of 4 MB (4,194,304 bytes) in 0.57 seconds. This corresponds to an average write rate of 1,874,411,222 bytes (= 1.87 GB) per second and an average IOPS value of 446 (considering a block size of 4 MB).
ceph tell osd.x bench - IOPS
The command can, of course, also be run for IOPS with a block size of 4K:
root@PMX4:~# ceph tell osd.1 bench 12288000 4096
{
"bytes_written": 12288000,
"blocksize": 4096,
"elapsed_sec": 0.043518787000000003,
"bytes_per_sec": 282360811.20551449,
"iops": 68935.744923221311
}
Here, due to the block size of 4K, a higher IOPS value is achieved: 68935 IOPS.
Ceph Pool Performance
| Performance tests at the Ceph Pool level require multiple executions and parallelization to obtain a realistic total performance value. Specifically: a single performance test never reflects the overall performance of the system. For example, on a 3-node Ceph cluster, we run the RADOS-Bench Write test 6 to 8 times in parallel to fully utilize the Ceph network. Please keep this in mind when executing these commands. |
|---|
A pool is a group of OSDs that collectively store and manage data. For an optimal pool, it is essential to select the correct number of PGs (Placement Groups). PGs are small groups within the OSDs responsible for data distribution and redundancy. Too few PGs can lead to poor data distribution, while too many PGs can place unnecessary strain on resources.
- Important: To accurately measure performance in a Ceph cluster, you should execute the following commands multiple times and in parallel. For this, using
tmuxis ideal. Running the command individually does not reflect the total system performance, as no parallelized accesses to Ceph occur.
- Excursus: PG Autoscaler: Using the PG Autoscaler, Ceph can automatically calculate the optimal number of PGs per pool, if desired. However, this is only recommended if you do not expect atypical data growth or data reduction in the system. This means you typically do not generate large amounts of data suddenly (hundreds of GB) or delete significant amounts of data.
- Why? Because with constant data growth or reduction, the PG Autoscaler would frequently adjust the optimal number of PGs, increasing or decreasing them. This is problematic because increasing or reducing PGs always triggers data redistribution within Ceph. Constant data rebalancing would degrade cluster performance and strain the drives (impacting TBW).
For typical medium-sized enterprises with normal data growth, we recommend using the PG Autoscaler for simplicity.
rados bench (write) - Throughput
To test the performance of a Ceph pool, the tool rados bench can be used. With rados bench, write operations can be executed on the pool, and the results are output in terms of IOPS and throughput.
rados bench -p vm_nvme 600 write -t 16 --object_size=4MB --no-cleanup
| Parameter | Explanation |
|---|---|
| -p vm_nvme | specifies the name of the pool where the test will run |
| 600 | sets the duration of the test in seconds (600 seconds equals 10 minutes) |
| write | specifies that the test performs write operations |
| -t 16 | sets the number of parallel threads to be used |
| --object_size=4MB | sets the block size |
| --no-cleanup | ensures that the test data created is not deleted after the test (this is needed for read operations) |
The results are output in bytes per second, and the test is performed with 16 parallel threads.
Total time run: 600.017 Total writes made: 383501 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 2556.6 Stddev Bandwidth: 1098.42 Max bandwidth (MB/sec): 4620 Min bandwidth (MB/sec): 1540 Average IOPS: 639 Stddev IOPS: 274.605 Max IOPS: 1155 Min IOPS: 385 Average Latency(s): 0.0250308 Stddev Latency(s): 0.0178465 Max latency(s): 0.15803 Min latency(s): 0.00460356
rados bench (read) - Throughput
Read operations can also be performed on the pool to test the read performance. The following command tests the read throughput of a Ceph pool with a block size of 4MB:
rados bench -p vm_nvme 60 read -t 16 --object_size=4MB
| Parameter | Function |
|---|---|
| -p vm_nvme | specifies the name of the pool where the test is executed. |
| 60 | sets the duration of the test in seconds (60 seconds equals 1 minute). |
| seq | specifies that the test performs sequential read operations. |
| -t 16 | sets the number of parallel threads to be used. |
This command runs a read throughput test on the pool vm_nvme, where sequential read operations are performed. The test runs for 60 seconds and outputs the results in bytes per second, using 16 parallel threads.
Total time run: 60.0276 Total reads made: 53525 Read size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 3566.69 Average IOPS: 891 Stddev IOPS: 20.0448 Max IOPS: 935 Min IOPS: 842 Average Latency(s): 0.0176248 Max latency(s): 0.0879089 Min latency(s): 0.00350174
rados bench (write) - IOPS
tbd
rados bench (read) - IOPS
tbd
ceph osd latency
To measure individual OSD drives under Ceph, the command ceph osd perf can be used. This command provides detailed performance data for each OSD in the cluster. It displays the OSD ID and both commit and apply latency.
- Commit Latency: The time an OSD needs to write the journal to the drive.
- Apply Latency: The time an OSD needs to transfer a write operation to its local disk. This represents the time it takes for a write operation to be fully stored on the OSD.
High commit or apply latency can indicate that the OSD is overloaded and cannot write fast enough, impacting the performance of the entire Ceph cluster. Low commit and apply latency, on the other hand, indicate that the OSD is working correctly and the underlying drive is performing well.
Conclusion - Overview of Relevant Factors
In summary, the following factors influence Ceph performance:
Network Connectionfor the Ceph network (low latency, high bandwidth)- Use of appropriate
Drives(flash, low latency, NVMe) - Resource requirements:
RAM and CPU Pool Configuration(number of PGs, replication factor)
Careful monitoring and measurement of these factors can help to optimize the performance of the Ceph system. For example, we recommend using checkMK.
References
- ↑ Intro to Ceph » hardware recommendations (docs.ceph.com/en/quincy)
- ↑ Production Setup (croit.io)
- ↑ Deploy Hyper-Converged Ceph Cluster (pve.proxmox.com/wiki)
|
Author: Jonas Sterr Jonas Sterr has been working for Thomas-Krenn for several years. Originally employed as a trainee in technical support and then in hosting (formerly Filoo), Mr. Sterr now mainly deals with the topics of storage (SDS / Huawei / Netapp), virtualization (VMware, Proxmox, HyperV) and network (switches, firewalls) in product management at Thomas-Krenn.AG in Freyung.
|
|
Translator: Alina Ranzinger Alina has been working at Thomas-Krenn.AG since 2024. After her training as multilingual business assistant, she got her job as assistant of the Product Management and is responsible for the translation of texts and for the organisation of the department.
|


