On BTRFS

There’s been a lot of discussion on this proposed Fedora change for Workstation to use BTRFS.

First off, some background: I reprovision my workstation about every 2-3 months to avoid it becoming too much of a "pet". I took the opportunity for this reprovision to try out BTRFS again (it’d been years).

Executive summary

BTRFS should be an option, even an emphasized one. It probably shouldn’t be the default for Workstation, and shouldn’t be a default beyond that for server use cases (e.g. Fedora CoreOS).

Why are there multiple Linux filesystems?

There are multiple filesystems in the Linux kernel for good reasons. It’s basically impossible to optimize for all use cases at once, and there are fundamental tradeoffs to make. BTRFS in particular has a lot of features…and those features have costs. Not every use case needs those features, and the costs can be close to prohibitive for things like databases.

BTRFS is good for "pet" systems

There is this terminology in the industry of pets vs cattle – I once saw a talk that proposed "elephants vs ants" instead which is more appealing. Lately I tend to use "disposable" or "reprovisionable" for the second term.

I mentioned above I reprovision my workstation periodically, but it’s still somewhat of a "pet". I don’t have everything in config management yet (and probably never will); I change things often enough that it’s hard to commit to 100% discipline to record every change in git instead of just running a CLI or writing a file. But I have all the important stuff. (And I take backups of data separately of course.)

For people who don’t have much in configuration management – the server or desktop system that has years of individually built up changes (whether from people doing things manually over ssh or interactively via a GUI like Cockpit, being able to take a filesystem snapshot of things is an extremely compelling feature.

Another great BTRFS-style use case is storing data like your photos on a local drives instead of uploading them to the cloud, etc.

The BTRFS cost

Those features though come at a cost. And this back to the "pets" vs "disposable" systems and where the "source of truth" is. For users managing disposable systems, the source of truth isn’t the Unix filesystem – it’s most likely a form of GitOps. Or take the case of Kubernetes – it’s a cluster with the primary source being etcd.

And of course people are using storage systems like PostgreSQL or Ceph for data, or an object storage system.

The important thing to see here is that in these cases, the "source of truth" isn’t a single computer (a single Unix filesystem) – it’s a distributed cluster.

For all these databases, performance is absolutely critical. They don’t need the underlying filesystem to do much other than pass through writes to disk, because they are already managing things like duplication/checksumming/consistency at a higher level.

As most BTRFS users know (or have discovered the hard way) you really need to use nodatacow for these – effectively "turning off" a lot of BTRFS features.

Another example: virtual machine images which is an interesting one because the "pet" vs "disposable" discussion here becomes recursive – is the VM a pet or disposable, etc.

Not worth paying for reprovisionable systems

For people who manage "reprovisionable" systems, there’s usually not much value using BTRFS for things like operating system data or /etc (they can just blow it away and reprovision), and a clear cost where they need to either use nodatacow on the things that do matter (losing a lot of the BTRFS features for that data), or explicitly use e.g. xfs/ext4 for them, going back into a world of managing "mixed" storage.

In particular, I would strongly argue against defaulting to BTRFS for Fedora CoreOS because we are explicitly pushing people away from creating these types of "pet" systems.

To say this another way, I’ve seen some Internet discussion about this read the proposed change as applying beyond Fedora Workstation, and that’s wrong.

But if you e.g. want to use BTRFS anyways for Fedora CoreOS (perhaps using a separate subvolume for /var where persistent container data is stored) that would be mounted with nodatacow for things etcd that could make sense! We are quite close to finishing root filesystem reprovisioning in Ignition.

But a great option if you know you want/need it!

As I mentioned above, my workstation (FWIW a customized Silverblue-style system) is a seems like a nearly ideal use case for BTRFS. I’m not alone in that! I’m likely going to roll with it for a few months until the next reprovisioning time unless I hit some stumbling blocks.

However, I am already noticing the Firefox UI periodically lock up for seconds at a time, which wasn’t happening before. Since I happen to know Firefox uses SQLite (which like the other databases mentioned above, conflicts with btrfs), I tried this and yep:

walters@toolbox> find ~/.mozilla/ -type f -exec filefrag {} \; | grep -Ee '[0-9][0-9][0-9]+ extents found'
firefox/xxxx.default-release/storage/.../xxxx.sqlite: 1825 extents found
...

And that’s only a few days old! (I didn’t definitively tie the UI lockups to that, but I wouldn’t be surprised. I’d also hope Firefox isn’t writing to the database on the main thread, but I’m sure it’s hard for the UI to avoid blocking on some queries).

I just found this stackoverflow post with some useful tips around manually or automatically defragmenting but…it’s really difficult to say that all Fedora/Firefox users should need to discover this and make the difficult choice of whether they want BTRFS features or performance for individual files after the fact. Firefox upstream probably can’t unilaterally set the nodatacow option on their databases because some users might reasonably want consistent snapshots for their home directory. A lot of others though might use a separate backup system (or Firefox Sync) and much prefer performance, because they can just restore their browser state like bookmarks/history from backup if need be.

Random other aside: sqlite performance and f2fs

In a tangentially related "Linux filesystems are optimized for different things" thread, the f2fs filesystem mostly used by Android (AFAIK) has special APIs designed specifically for SQLite, because SQLite is so important to Android.

Conclusion

All Fedora variants are generic to a degree; I don’t think there will ever be just one Linux filesystem that’s the only sane choice. It makes total sense to have BTRFS as a prominent option for people creating desktops (and laptops and to a lesser degree servers).

The default however is an extremely consequential decision. It implies many years of dealing with the choice in later bug reports, etc. It really requires a true committment to that choice for the long term.

I’m not sure it makes sense to push even Linux workstation users towards a system that’s more "pet" oriented by default. How people create disposable systems (particularly for workstations) is a complex topic with a lot of tradeoffs; I’d love for the Fedora community to have more blog entries about this in the Magazine. One of those solutions might be e.g. using a BTRFS root and using send/receive to a USB drive for backups for example!

But others would be about the things I and others do to manage "disposable" systems: managing data in /home in git, using image systems like rpm-ostree for the base OS to replicate well known state instead of letting their package database be a "pet", storing development environment as a container image etc. Those work on any Unix filesystem without imposing any runtime cost. And that’s what I think most people provisioning new systems in 2020 should be doing.


19 comments

  1. the same argument could be said about Crap FlatPack. , not many like it or will use it. , ext4 was great in its time, BTRFS has its advantages over a Dying FileSystem. ,every filesystem gets fix’s with each kernel release, , no one is forcing you to use Fedora Colin. use CentOS or use another filesystem in Fedora.

    • There’s not much constructive I see in your comment here… but I approved it anyways since it’s not outright spam at least!

      As far as “use CentOS” – I am pretty involved in Fedora and what it is (as a maintainer of e.g. Fedora CoreOS and several packages), and appreciate easily tracking newer upstream versions of a lot of things for my workstation as a general rule, and helping to build and sustain the upstream for RHEL and other linked FOSS communities.

      For “use another filesystem” – sure, the same argument is easily made the other way, but the impact here goes beyond just me. I raised a number of technical points in my post that I am hoping merits more of a constructive response and debate from others!

  2. The benefits for workstations with btrfs are enormous, they’re just perhaps different from what _you_ personally want.

    For example, reflinks make local copies instantaneous. Send/receive makes backups magic. Data integrity features make us safer. Snapshots probably help immensely, even though I don’t use them on my workstation. And this helps get an important filesystem into more hands, which might help it get more mature!

    The most important thing is that new FS features can enable software developers to come up with new things to build. So as a result of this being the default in Fedora, it might encourage some hacker to build a revolutionary (or at least decent) new feature in somewhere. It might make GNOME better in some way.

    Of course there are downsides. LUKS+Btrfs is very buggy in Anaconda. (Isn’t anaconda always buggy though?) But hopefully making this the default will help improve all of this.

    I’ve been running btrfs as a “pet” workstation for over 6 years I think, and I’ve had no issues.

    I know that you know most of this, but since you article makes the opposite case, I figured I’d offer my reasoning why I support it being the default.

    Cheers!
    James

    • Data integrity features make us safer.

      Sorry, but that’s not how things are really working.
      Data integrity is useful only when you have a RAID configuration, because Btrfs/ZFS/MD-RAID are able to keep data and metadata consistency by constantly analyzing the blocks within the replicas.
      In a single disk configuration (the majority of desktop/workstation configurations are just like this) data integrity is almost non-existent. I said “almost” because Btrfs would be still able to identify a data corruption, yet it couldn’t do anything to prevent or fix it since there is no replica to use to fix corrupted blocks. So in the end, in this scenario it is just like using ext4 or XFS.

      • It cannot *fix* data corruption automatically without raid, but it does prevent it from getting into your backups by returning EIO rather than bad data. With ext4, you’ll just wind up overwriting your good backups with corrupted backups. With btrfs, that won’t happen. The data integrity feature admittedly does you no good if you don’t keep any backups at all, but the lack of a default backup tool in Fedora is a separate problem we need to work on. Ubuntu nags if you don’t have backups enabled in Deja Dup; we aren’t going to use Deja Dup, but should put together something similar to protect our users’ data.

  3. “However, I am already noticing the Firefox UI periodically lock up for seconds at a time, which wasn’t happening before.”

    Ah so I’m not alone with that one. Sadly also happens to me, on a non-BTRFS system. Don’t think that’s related to the FS per se.

    • Same here, I started having those Firefox lockups right after installing Fedora 32, this is a thinkpad laptop with default EXT4. So definitively not btrfs related (which doesn’t mean it cannot be used to illustrate the author’s point).

      Btw, if anyone knows where the bug is filed, I’d like to subscribe to it.

  4. Hi Colin,
    thank you for sharing your point of view.

    That said, to be honest I think the BTRFS/ZFS approach as overkilled in a traditional server environment for ROOT/DATA filesystem.
    You said (sorry I don’t know how to quote here 😦 ):
    “For people who don’t have much in configuration management – the server or desktop system that has years of individually built up changes (whether from people doing things manually over ssh or interactively via a GUI like Cockpit, being able to take a filesystem snapshot of things is an extremely compelling feature.”

    But do we really need it? If we’re talking about servers, they are most probably running on a hypervisor nowadays.
    If we need periodically snapshots, wouldn’t it be better to perform those VM-side, in order to have them separated from the same disk of the OS? Yes, you could technically mount a different subvolume/disk and perform a btrfs sub snap / /path/of/dest but I still think that performing it VM-side is more useful, since we’re going to do that *anyway* for patching purposes.
    The same could be said about physical servers: who cares about btrfs snapshots when data are stored in LUNs with more reliable storage-based replications? Also, even there, root file systems are backupped by third party backup solutions (we’re using Acronis) or, if they are booting from SAN, they are replicated storage-side.
    I would never trust a file system as much as I can trust my storage infrastructure.

    For workstation purposes, I think that Fedora/Red Hat CoreOS approach (OSTree+pivot) is the way to follow. When you have regular backups for your data and a underlining immutable system that can download an OS image from a secure registry and can pivot it to being the new root, I think that the needs for snapshots is decreased by an order of magnitude.

    The only use case I see for ZFS is actually being the storage platform, for SAN and NAS storage.

    • I think it’s generally an anti-pattern to entangle backups of the operating system with data. They’re distinct things – but in your post you argue it both ways. You suggest backing up whole VM images, but then point out (correctly) that the ostree model provides the benefit of “snapshots” for the OS side only, leaving you to take care of your data how you like.

      So I would avoid VM snapshots in favor of backing up just *your data* – whether that’s an agent on the VM or (much more commonly as you note) using separate network-attached storage.

      • I think it’s generally an anti-pattern to entangle backups of the operating system with data.I absolutely agree.
        In fact, for physical servers we use:
        1) Acronis for root file systems (only when our physical server boot from disk, otherwise we just replicate the LUN storage-side)
        2) storage replication from data
        3) TSM for data backup

        For virtual servers we use:
        1) VM snapshots for the OS (data vdisks are excluded)
        2) TSM for data backup

        Sorry if I was not clear. English is not my first language.

        “You suggest backing up whole VM images, but then point out (correctly) that the ostree model provides the benefit of “snapshots” for the OS side only, leaving you to take care of your data how you like.”

        Well, but actually you can exclude data drives from VM snapshots, if you need a separate management. On vmware (I don’t know that much about oVirt so I can only talk for vmware) you have to set the drive in “Independent” mode.
        We’re using it for our database instances on our VMs, since we’re going to manage them separately (we use Oracle’s RMAN+IBM Tivoli). We set the independent mode for every drive that is part of the data VG.

  5. “I’ve seen some Internet discussion about this read the proposed change as applying beyond Fedora Workstation, and that’s wrong.”

    To clarify: the proposal is for desktop variants, so it does include more than Workstation (e.g. KDE and LXDE spins). You’re correct, though, that it does not include Fedora CoreOS, IoT, Server, etc.

  6. I’m not an expert on filesystems, but a few years ago I tried Btrfs and lost all of my data due to hard-to-track issues caused by it. And I’m never gonna look back at it…

  7. ‘nodatacow’ means data overwrites similar to ext4/XFS behavior; and it means no checksums and no compression. You can still make reflink copies and snapshots. You can still btrfs send/receive snapshots containing these files. I’m not sure what “lot of features” you think are somehow disabled, or why it’s a negative to have the ability to selectively disable COW, per subvolume, per directory, or per file as an optimization if the use case prefers it.

    By the way, you might switch to mq-deadline if you’re using bfq, see if that helps the firefox issue. I’m using NVMe and tracking a few thousand extents is not a big deal, my Fedora on Btrfs is still faster than when I reboot Windows 10. There might be a good reason for enabling autodefrag for this use case on rotational devices by default, perhaps even on SSDs. Autodefrag is intended specifically for these kinds of desktop/laptop use case where the insides of files are frequently overwritten.

    Facebook is using Btrfs in millions of servers many millions of containers, based on a send/receive workflow. I very much doubt at this scale it makes any sense for them to choose a file system that suddenly makes that many systems “pets”.

  8. > ‘nodatacow’ means data overwrites similar to ext4/XFS behavior; and it means no checksums and no compression. You can still make reflink copies and snapshots. You can still btrfs send/receive snapshots containing these files. I’m not sure what “lot of features” you think are somehow disabled,

    Thanks, that’s useful. I had thought it excluded snapshots. That said I don’t think it’s just “an optimization” – the performance hit is quite large as you know.

    > Facebook is using Btrfs in millions of servers many millions of containers, based on a send/receive workflow. I very much doubt at this scale it makes any sense for them to choose a file system that suddenly makes that many systems “pets”.

    This is an interesting sub-thread. Yes, one can use BTRFS for non-pet use cases too. But you’d have to try *really* hard to convince me (and the rest of the world) that a home-grown BTRFS send-receive for containers is *more* compelling than using industry standard OCI/Docker unpacking onto overlayfs etc. Again: sure you can do it, but it isn’t any better than the alternatives.

    • If you think the “datacow” stuff is causing issues with firefox, just chattr it so it only fixes ~/.mozilla/ and get that fix upstream. Then you have the best of both worlds.

  9. Pingback: Fedora 33 mit Btrfs als Standard-Dateisystem | linuxnews.de linuxnews.de

  10. I think even in the Linux world most people don’t do a full reprovisioning of their workstations on a regular basis. I think deploying Fedora with tools like snapper that come with openSUSE could have a real advantage for a lot of users when it comes to file recovery, and the ability to roll back updates, etc.

  11. ~@stitny$ find ~/.mozilla/ -type f -exec filefrag {} \; | grep -Ee ‘[0-9][0-9][0-9]+ extents found’
    /home/matej/.mozilla/firefox/6rek90q9.default-1561790693943/storage/default/https+++riot.im^userContextId=2/idb/1680441260lsog.sqlite: 1023 extents found
    /home/matej/.mozilla/firefox/6rek90q9.default-1561790693943/weave/bookmarks.sqlite: 158 extents found
    ~@stitny$ mount|grep home
    /dev/mapper/home on /home type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
    ~@stitny$

    Meaning I have very similar fragmentation on XFS.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s