As announced in the previous post, I rebooted the blog in order to write about our experiences with OpenStack and Ceph. For the first topic I picked the Ceph cache tiering, as we have recently struggled quite a bit to keep it up and running ;)
We have a lot of short living VMs which are used for automated tests. Those tests are quite IO heavy so we decided to put an NVMe flash based cache tier in front of our regular Ceph pool used for VMs. I’ll post a more detailed overview about our Ceph and OpenStack deployment (and how we arrived there) in future posts, so for now we’ll directly dive into the oddities of Ceph caches.
The cache parameters
According to the Ceph documentation the main cache configuration parameters are
target_max_bytes target_max_objects cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age cache_min_evict_age
and with newer versions (we’re currently on the Hammer release) there is one more:
Let’s get over them and explain what they do (Disclaimer: I’m not a Ceph developer nor have I read the source code, so take the statements with a grain of salt. In case you spot an error let me know and I’ll update the post ASAP). In general the idea of a cache is that objects (in our case objects which are managed by the rados block device layer to build disk images for VMs) which are read are promoted from the backing pool to the cache pool. In case of a write / creation of a new object the object will first get created in the cache pool. It will later be written to the backing pool, hopefully asynchronously without impacting client IO. This brings us to the relevant terminology.
Flushing vs. Evicting
As it was not totally clear to me when reading the documentation, flushing is the process of writing a dirty object (an object which was modified in the cache) down to the backing pool. Flushing does not evict the object from the cache layer. It remains there as clean object.
Removing an object from the cache pool is called evicting. Obviously an object needs to be clean (aka flushed) before it can be evicted. In case a dirty object is subject to eviction it will be flushed first.
Now that we have the terminology we can go over the parameters to tune our cache the way we want.
First things first. Probably the most important parameter to set is
target_max_bytes. To quote the latest documentation:
Ceph is not able to determine the size of a cache pool automatically, so the configuration on the absolute size is required here, otherwise the flush/evict will not work.
This has been added recently to the documentation and I cannot state how important it is. Without setting the maximum number neither flushing nor evicting will ever happen and in case your cache pool has less disk space available than your backing pool (which should be the normal case, otherwise why having a cache pool at all?) you’re cache pool will eventually run out of disk space. Not something Ceph is particularly good at.
target_max_bytesis the amount of bytes for CLIENT objects. So if you have some replication on the cache pool, make sure to adjust accordingly (e.g. with
size = 3divide the available disk space by 3)
Once you have established an upper limit, the next relevant parameters are
cache_target_dirty_ratio. Those parameters determine (in percentage) when the cache tiering agent will start to flush (dirty_ratio) and when to evict (full_ratio) objects from the cache. In case you have set max_bytes to a limit where you still have some headroom for Ceph to operate you probably want to set full_ratio to 0.9 or even 1. Evicting happens in a least recently used (LRU) fashion.
The dirty_ratio parameter is more trickier and depends on your workload. If you have (like we do) a lot of very short living VMs you can set that to a reasonably high number as well (e.g. 0.7, which in case you’ve set full_ratio to 0.9 leaves 20% for clean objects in the pool).
cache_target_dirty_ratiois an absolute value (relative to the
target_max_bytes) and not relative to the full_ratio
As flushing has some severe impact on the client IO newer Ceph versions (since Infernalis) have the
cache_target_dirty_high_ratio where reaching the dirty_ratio will trigger flushing with a reduced rate until the high_ratio is reached.
When to flush / evict
Last but not least we have
cache_min_evict_age, both set in seconds. Those are the minimum times an object needs to be in the cache before it is considered for flushing or evicting. As with the dirty_ratio these parameters heavily depend on your workload. Make sure that when your cluster reaches its dirty_ratio or full_ratio you have enough objects which are “old” enough to be flushed or evicted.
Operating a cache and some internals
Usually when operating a Ceph cluster we try to stay away from ever letting it become full. This is different on a cache pool so some special care regarding monitoring has to be taken.
Keep an eye on the actual (OSD) disk space being used. Make sure to stay away from the magical 80% mark Ceph recommends. Especially when changing the
target_full_ratio settings make sure you don’t run out of space.
Another helpful metric is the amount of objects and dirty objects in your cache pool.
ceph df detail will give you some basic overview about those values.
Note: On regular pools all objects are counted as “dirty” so no worries there ;)
Internally Ceph uses 0-byte files on both cache and backing pools to indicate special cases. A 0-byte file in the cache indicates that this object has been deleted and in case of a flush the object will be deleted in the backing pool. A 0-byte file in the backing pool indicates that this is an object created in the cache pool but has not yet been flushed to the backing one. With this in mind it also makes sense to track the number of 0-byte files in the monitoring (see next chapter).
At least with Hammer (0.94.x, the version we’re currently using) it looks like there is an issue when
target_max_objects is not set. Objects (aka files on disk) are not deleted when the object is deleted and no longer referenced. Hence we ended up with a lot of 0-byte files (around 110 millions of them) taking up nearly one terabyte of disk space (yes, even those 0-byte files require some inode entries in the file system, and having 100 million of them require a lot of inodes ;) ). Setting
target_max_objects seemed to trigger evicting of those files, but we will now keep a very close eye on the number of 0-byte files.
A special thank you goes to Christian Balzer and Burkhard Linke on the ceph-users mailing list, who were both tremendously helpful. Especially Christian has done a similar writeup on the mailing list, see Cache tier operation clarifications and the follow up messages.
If you have any additions or more important corrections to the above, please let me know through eMail or Twitter.