February 20, 2020

Storage Isn't Easy

Storage Isn't Easy

Recently I posted about my plans for a mostly new build, and how I planned to tackle storage, I've been reading a lot about filesystems and as far as I'm concerned it really comes down to two modern contenders. BTRFS & ZFS.

The two of them have been hotly debated over the past decade with vim vs emacs style camps emerging. The nano camp exists but they've only really got a lean-to and half a stick to rub together for warmth, they can play with EXT4 diehards.

BTRFS and ZFS have features like volume management, Snapshots, copy-on-write, native compression and a lot more. Some of these things come at a cost but as far as I can see, it's usually worth the price.

Both systems are comparable performance wise so we won't delve into that too much, zfs is supposedly a little faster but it's not a huge concern.


As mentioned I did delve into the back of my shelves and check out what I had kicking around, currently there are 6 SATA ports available, one of which is esata giving me a total of 5 usable and one weird.

Gone are the two swap drives, and welcome in two 500GB 3.5" disks, one 320Gb 2.5" drive , a 512GB 2.5" SSD and the 2TB 2.5" disk currently being used.


Better (Butter) FS is as hotter younger popular underdog as you get when it comes to comparing filesystems, it's GPL compliant, universally supported (In linux at least) and does everything listed above

There's one single filesystem command with subcommands from it, zsh auto completes them so that's just nice. In compairson to it's primary competitor it's way more flexible, we can give it various sized drives, swap up and down raid levels on a live FS, add and remove filesystems as we go. It's got kernel support from 3.something upwards if not earlier.

However we have an ssd & btrfs has no native SSD cache, we'll need bcache. Setting up bcache is fairly easy but there's one downside, you need a clean FS, I can only guess as to why but that's the way it is, we want to use it, we'll need to do the data shuffle. This means making a new filesystem setting up bcache and then rolling everything across, then we can get some of the benefits of an SSD and have the size of spinners, seems like a great plan, the alternative is just to roll the SSD into our pool and have it store data like everything else, with a sizable m2 device coming in the next hardware upgrade this may make sense but I've got a bit of plan for that.

Finally this isn't remarkably bad but docker uses btrfs for volume storage and that clutters any output when listing subvolumes.

The red line

The write hole: raid5 and 6 setups can become corrupt in power failure events, my building has infrequent (once or twice a year) blackouts which could make this risky, however supposedly if we raid1 the metadata we can still raid5 the data.


This really is going to be the opposite of everything listed above, as otherwise the feature set is pretty comparible, ZFS is old and stable, it has a lot of support and for established enterprise systems it's probably the way to go.

It's reportedly faster than btrfs and has native support for SSD caching using l2arc.

however using ZFS on linux means waiting for the ZFS team to push out a kernel with support, additionally there's some issues related to licencing.

The Red Line

Lack of flexibility is a big killer for me, I know I'm going to swapping out harddrives of different sizes, so from what I've read ZFS might not be the way to go.



I built a ceph cluster a while back and I love the idea of cross network filesystems that are device independent the idea of having a single shared storage across all devices which utilises their hard drive space and can be infinitely expanded all sounds great, but computationally this is a little expensive, I want to use my desktop for some heavy work and fear that ceph could just get in the way, additionally my devices power up and down fairly frequently, this could mean a lot of network utilisation as ceph automatically rebalances, plus I use mobile networks frequently having an OSD on a device outside the internal network would eat mobile data. Plus everything would need internal static ip addresses, I could use nebula or something of that ilk but realistically this setup would be better implemented when I get a rack and a few servers without these it's a mess.

I do have an externally accessible location that I typically mount to /filestore using sshfs, it works fairly well and means for ceph the costs are fewer than the benefits.


Generally more performant that btrfs it lacks some of the capabilities I intend to use, snapshots and compression specifically, these features are planed but not currently present.

What I'm actually doing

Simply, everything is going to be btrfs, / is going onto the m2, and I'll make two subvolumes for home and var/lib/docker which will be fronted by the SSD  which will be bcachified. I haven't quite figured out how I'm going to shuffle all my data around to get into the newly bcached filesystem, I want to raid1 the backing fs and if we're very lucky this should result in a filesystem I can really get bedded into, I've had a habit of scrapping them every few years and over the past few days I've been doing a lot of digital archiology, I'm getting quite attached to my arch install and I'm hoping I can build something stable which I can support in the long term.

Time will tell.