Why not ZFS

Posted on
linux zfs sysadmin

ZFS is a hybrid filesystem and volume manager system that is quite popular recently but has some important and unexpected problems.

It has many good features, which are probably why it is used: snapshots (with send/receive suppport), checksumming, RAID of some kind (with scrubbing support), deduplication, compression, and encryption.

But ZFS also has a lot of downsides. It is not the only way to achieve those features on Linux, and there are better alternatives.

Terminology

In this post I will refer to the ZFS on Linux project as ZoL. It was renamed to OpenZFS since ZoL got FreeBSD support, and FreeBSD’s own in-tree ZFS driver was deprecated in favor of just periodically syncing ZoL from out-of-tree.

What is “Scrubbing”? If a disk has an unrecoverable read error (URE) when reading a sector, it’s possible to repair the sector by rewriting its contents; the physical disk detects the rewrite over an unreadable sector and performs remapping in firmware. The RAID layer can do this automatically by relying on its redundant copy. Scrubbing is the process of periodically, preemtively reading every sector to check for UREs and repair them early.

Bad things about ZFS

Out-of-tree and will never be mainlined

Linux drivers are best maintained when they’re in the Linux kernel git repository along with all the other filesystem drivers. This is not possible because ZFS is under the CDDL license and Oracle are unlikely to relicense it, if they are even legally able to.

Therefore just like how all proprietary software eventually finds a GPL implementation and then a BSD/MIT one, ZFS will eventually be superceded by a mainline solution, so don’t get too used to it.

As an out-of-tree GPL-incompatible module, it is regularly broken by upstream changes on Linux where ZoL was discovered to be abusing GPL symbols, causing long periods of unavailability until a workaround can be found.

When compiled together, loaded and runnning, the resulting kernel is a combined work of both GPL and CDDL code. It’s all open source, but your right to redistribute the work to others requires your compliance with both the CDDL and GPL license which can’t be satisifed simultaneously.

It’s still easy to install on Debian. They ship only ZoL’s source code with a nice script that compiles everything on your own machine (zfs-dkms), so it is technically never redistributed in this form which satisfies both licenses.

Ubuntu ships ZFS as part of the kernel, not even as a separate loadable module. This redistribution of a combined CDDL/GPLv2 work is probably illegal.

Red Hat will not touch this with a bargepole.

You could consider trying the fuse ZFS instead of the in-kernel one at least, as a userspace program it is definitely not a combined work.

Slow performance of encryption

ZoL did workaround the Linux symbol issue above by disabling all use of SIMD for encryption, reducing the performance versus an in-tree filesystem.

Rigid

To first clarify ZFS’s custom terminology (not considered a really bad point because even LVM2 is also guilty of using custom terminology here):

  • A “dataset” is a filesystem you can mount. It might be the main filesystem or perhaps a snapshot.
  • A “pool” is the top-level block device. It is a union span (call it RAID-0, stripe, or JBOD if you like) of all the vdevs in the pool.
  • A “vdev” is the 2nd-level block device. It can be a passthrough of a single real block device, or a RAID of multiple underlying block devices. RAID happens at the vdev layer.

This RAID-X0 (stripe of mirrors) structure is rigid, you can’t do 0X (mirror of stripes) instead at all. You can’t stack vdevs in any other configuration.

For argument’s sake, let’s assume most small installations would have a pool with only a single RAID-Z2 vdev.

Can’t add/remove disks to a RAID

You can’t shrink a RAIDZ vdev by removing disks and you can’t grow a RAIDZ vdev in by adding disks.

All you can do in ZFS is expand your pool to create a whole second RAIDZ vdev and stripe your pool across it, creating a RAID60 - you can’t just have one big RAID6. This could badly affect your storage efficiency.

(Just for comparison, mdadm lets you grow a RAID volume by adding disks since 2006 and shrink by removing disks since 2009.)

Growing a RAIDZ vdev by adding disks is at least coming soon. It is still a WIP as of August 2021 despite a breathless Ars Technica article about it in June.

There are several Ars Technica links in this blog post. I like Ars a lot and appreciate the Linux coverage, but as an influencer why are they so bullish about ZFS? It turns out all their ZFS articles are written by this person who is also a mod of /r/zfs and hangs out there a lot. At least he is highly informed on the topic.

RAIDZ is slow

For some reason ZFS’s file-level RAIDZ IOPS only scale per vdev, not per underlying device. A 10-disk RAIDZ2 has IOPS similar to a single disk.

(Just for comparison, mdadm’s block-level RAID6 will deliver more IOPS.)

File-based RAID is slow

For operations such as resilvering, rebuilding, or scrubbing, a block-based RAID can do this sequentially, whereas a file-based RAID has to perform a lot of random seeks. Sequential read/write is a far more performant workload for both HDDs and SSDs.

File-based RAID offers the promise of having to do less work and avoiding RAIDing the empty space, but in practice it is outweighed significantly by this difference.

It’s especially bad if you have a lot of small files.

It is even worse on SMR drives, and Ars Technica blame the drives when they probably should have blamed ZFS’s RAID implementation.

(Just for comparison, mdadm works perfectly fine with these drives.)

Real-world performance is slow

Phoronix benchmarks of ext4 vs zfs in 2019 show that ZFS does win some synthetic benchmarks but badly loses all real-world tests to ext4.

Performance degrades faster with low free space

It’s recommended to keep a ZFS volume below 80 - 85% usage and even on SSDs. This means you have to buy bigger drives to get the same usable size compared to other filesystems.

At high utilization, most filesystems will take a little bit longer to fragment new writes into rarer free blocks, but ZFS’s problem is on an entirely different level because it does not have a free-blocks bitmap at all.

(Just for comparison, ext4 and even NTFS perform highly well up to extremely high percentage utilization.)

Layering violation of volume management

ZFS is both a filesystem and a volume manager.

However, its volume management is exclusive to its filesystem, and its filesystem is exclusive to its volume management.

If you use ZFS’s volume management, you can’t have it manage your other drives using ext4, xfs, UFS, ntfs filesystems. And likewise you can’t use ZFS’s filesystem with any other volume manager.

Therefore you’ll need to know both the standard mount/umount/fstab commands as well as an entirely separate set of zfs/zpool/zdb commands and never the two shall meet.

On Linux you definitely can’t escape the traditional volume manager (e.g. mount -a on your fstab is used for swap) and likewise on FreeBSD (where mount on UFS is still the standard filesystem) where ZFS is supposedly “better integrated”.

This is understandable given ZFS is trying to do file-level RAID, but I’ve explained that performs badly and was probably a bad idea.

Despite being a copy on write (CoW) filesystem it doesn’t support cp --reflink. Btrfs does. Even XFS does despite being a traditional non-CoW filesystem.

High memory requirements for dedupe

For all intents and purposes, this online deduplication feature may as well not exist.

The RAM requirements are eye-wateringly high (e.g. 1GB per 1TB pool size may not be sufficient), because the deduplication table (DDT) is kept in memory and without this much RAM the performance degrades significantly.

Deduplication generally offers neglegible savings unless perhaps you’re storing a lot of VM disk images of the same OS. There is a nice tool to estimate the dedupe savings (zdb -S) but general recommendations are not to bother with ZFS dedupe unless you can save 16x storage (!!!) owing to the extreme performance impact of the feature.

(By comparison lvmvdo has similarly bad performance but at least uses significantly less RAM.)

Dedupe is synchronous instead of asynchronous

This means that if deduplication is enabled, every single write operation has to undergo read/write IOPS amplification.

(By comparison, btrfs’s deduplication and Windows Server deduplication run as a background process, to reclaim space at off-peak times.)

High memory requirements for ARC

Linux has a unified caching system for file operations, block IO (bio) operations, and swap, called the page cache. ZFS is not allowed to use the Linux page cache at all, because such a deep part of Linux’s design can only be accessed via GPL symbols and the CDDL source code can’t rely on them.

Therefore ZoL implements its own cache, the Adaptive Replacement Cache (ARC) that constantly fights with the Linux page cache for memory.

The infighting problem is really bad but since then the heuristics were improved to the point where it still used >17GB of RAM just for the ARC.

If you do literally anything else on the PC other than be a ZFS host (e.g. use a web browser, browse an SMB share…) then you are subject to this infighting.

If a program uses mmap, files are double-buffered in both the ARC and the Linux page cache.

Even on FreeBSD where ZFS is supposedly better integrated, ZoL still pretends that every OS is Solaris via the Solaris Porting Layer (SPL) and doesn’t use their page cache neither. This design decision makes it a bad citizen on every OS.

Even the fuse ZFS has better page cache properties here (although it has lower performance in general).

Buggy

At time of writing there are 387 open issues with Type: Defect label on the ZoL Github and the bulk of them seem to be genuinely important problems, such as logic bugs, panics, assertions, hanging, system crashes, kernel null pointer dereferences, and xfstests failures.

One good thing to say about the ZoL project is the triage and categorization of these bugs is well organized.

No disk checking tool (fsck)

Yikes.

In ZFS you can use zpool clear to roll back to the last good snapshot, that’s better than nothing.

One common excuse for the missing tool is that the CoW data structures are always consistent on disk, but so is ext4 with its journalling. ZFS can still get corrupt on disk for various reasons:

  • merely rolling back to the last good snapshot as above does not verify the deduplication table (DDT) and this will cause all snapshots to be unmountable
  • coupled with the above point (“Buggy”) if ZFS writes bad data to the disk or writes bad metaslabs, this is a showstopper

and so it should have an fsck.zfs tool that does more repair steps than just exit 0.

Past complaints that are now fixed

  • no TRIM support (added in ZoL 0.8.0 in mid 2019)

Things to use instead

The baseline comparison should just be ext4. Maybe on mdadm.

Then if you want the other features, you can easily get them from either the block layer (if I convinced you file-level RAID was a bad idea), or from a filesystem (if you were not convinced), or because it’s Linux you can mix-and-match features from each as you like.

RAID is best done at the block layer with mdadm. LVM2 has its own wrapper for this (lvraid) which is more nicely integrated, but using mdadm directly is more debuggable. Btrfs has a file-level RAID feature that is okay for 0/1 but not for 5/6, better stick with mdadm.

Encryption is best done at the block layer with LUKS (cryptsetup). Btrfs has a feature for it too. Both perform significantly better than ZFS owing to the aforementioned SIMD symbol workaround for Linux 5.0.

Snapshots can be done with LVM2 thin pools or by swapping ext4 for btrfs or (wildcard suggestion) NILFS2. LVM2 is the more performant approach. Support for send/receive is built-in to btrfs, and is easily available for LVM2 with a utility like lvmsync, lvm-thin-sendrcv, or thin-send-recv.

Scrubbing simply needs to read every file from the disk so the RAID layer notices and repairs a URE. You can simply put cat /dev/array > /dev/null on cron once a month which is enough for mdadm to notice and repair UREs.

Deduplication is usually not worthwhile - for VM disk images, it’s better to use differencing disks in your hypervisor, and for for storing backups, it’s better to use a real deduplicating backup store like borg, restic, or kopia. But you can easily get this if you want, with LVM2’s new lvmvdo / kvdo and with btrfs use any off-peak daemon such as dduper or bees.

Compression is usually not worthwhile - most files (e.g. Microsoft Office’s XML.zip documents, JPGs, binaries, …) do not benefit from compression, and the files that do (e.g. sqlite databases) are usually sparse for performance reasons. But you can easily get this if you want with lvmvdo, or it’s an option when you create a btrfs filesystem.

Checksumming is usually not worthwhile - the physical disk already has CRC checksums at the SATA level, and if you are paranoid you should also have ECC ram to prevent integrity issues in-memory (applies to ZFS too), and this should be enough. But you can easily get this if you want, either at the block layer with dm-integrity (integritysetup) below your disk or btrfs does it automatically.

Checksumming can also be done offline at the file level by running hashdeep / cksfv on cron to create a *.sfv file of all your file hashes. This would also be a replacement for the scrubbing process.

Summary

If you use upstream Linux features such as mdadm, LVM2, and/or btrfs instead of ZFS, you can achieve all the same nice advanced features, with the side benefit that it won’t break with any upstream kernel update; it’s legal; it’s faster; it works on SMR drives; it uses less RAM; it has a real repair tool; and it works better with other standard Linux features.

It might seem like there are more parts to set up, but actually all these features need to be configured and enabled on ZFS too, it’s not really any simpler. ZFS also has a lot of tuning parameters to set.

In the future we’re waiting to see what stratis and bcachefs offer. For extremely large installations you should also consider doing the erasure-coding in userspace with Ceph or OpenStack Swift.

There are probably some situations where ZFS still makes sense and it’s interesting to learn about all solutions in this space. But in general I couldn’t recommend using it.