Ceph BlueFS spilling and blocks.db resizing

April 23, 2024

I recently stumbled upon the following warning on my production Ceph cluster for an OSD on one of our machines:

root@ceph-8:~# ceph health detail
HEALTH_WARN 1 OSD(s) experiencing BlueFS spillover; 1 stray daemon(s) not managed by cephadm
[WRN] BLUEFS_SPILLOVER: 1 OSD(s) experiencing BlueFS spillover
     osd.211 spilled over 282 MiB metadata from 'db' device (7.0 GiB used of 8.7 GiB) to slow device

I didn’t find much info about it, and it was mainly when people did play with limits around the blocks.db size but I follow sizing recommendations and provision them at a bit more than 4%. It should be more than enough since we’re only using RBD for now, so I was a bit surprised to get this warning.

Upon a closer look, I found that the blocks.db volumes were severly undersized (0.4% instead of 4%) thanks to a typo in the variable used by the ansible playbook that created the live volumes.

Since it is just a typo and the SSD drives were large enough to hold the 4%, I could just resize the blocks.db live volume. I wasn’t sure I could do that without needing to reprovision the host entirely, but I still tried it seeing the other option would be recreating the host from scratch.

So I shut down the OSD, use ceph-volume lvm list to find the corresponding blocks.db live volume, then extend it:

lvextend vg-ceph-db-0/ceph-db-2 -L +75528M
systemctl restart ceph-$cluster-id@osd.211.service

The restart failed, but the daemon booted normally. To get the warning to go away, I just needed to “compact” the OSD: ceph tell osd.211 compact.

As I got more OSDs to fix, I had a lot of occurences where compacting the OSD wasn’t sufficient to fix the spilling problem. I finally found this message on the ceph-users where the solution is explained with ceph-bluestore-tool. I did try it before but it wouldn’t work without the cephadm shell, I suppose because of versions mismatch. Since the lvextend was already done for every OSDs, I wrote the following script to fix every OSDs still spilling:

#!/usr/bin/env bash

cid="$cluster_id"
osds="209 212 214 216 218 219 221 222 223 224 226 227"

for osd in $osds
do
        systemctl stop ceph-$cid@osd.$osd.service
        sleep 2
        cephadm shell --fsid $cid --name osd.${osd} -- ceph-bluestore-tool \
                bluefs-bdev-migrate \
                --path /var/lib/ceph/osd/ceph-${osd} \
                --devs-source /var/lib/ceph/osd/ceph-${osd}/block \
                --dev-target /var/lib/ceph/osd/ceph-${osd}/block.db
        sleep 2
        systemctl start ceph-$cid@osd.$osd.service
done

After executing that on the two hosts (with the correct OSD IDs) impacted, all the warnings were gone!

Tags:

ceph lvm linux