“Immutable” → reprovisionable, anti-hysteresis

This is going to be a longer blog entry, but here’s a TL;DR:

I propose that instead of "immutable" or "read-only" when talking about operating systems (such as Fedora CoreOS, Google COOS, Flatcar etc.), we use these terms:

  • "fully managed": The system does not have "unmanaged state" – e.g. an admin interactively doing ssh and making changes not recorded declaratively somewhere else
  • "image based": Traditional package managers end up with a lot of "hidden state" (related to above); image based updates avoid that
  • "reprovisionable" and not a "pet": I don’t like the industry "pets vs cattle" term, and I think "reprovisionable" is both nicer and more descriptive
  • "Has anti-hysteresis properties": (Yes I know this is an awkward term) See https://en.wikipedia.org/wiki/Hysteresis – I’ll talk more about this later

(Terminology note: In this article also I will use the abbreviation "pkgmgrs" for "traditional package managers like apt/yum". Systems like NixOS and some aspects of swupd from Clear Linux improve parts of what I’m talking about, but this article is already really long and a detailed comparison including those really deserves a separate post)

Why not "immutable"/"read-only"?

Because it’s very misleading. These system as a whole is not immutable, or read-only, or stateless – there are writable, persistent data areas. And more importantly, those writable data areas allow persistently storing privileged code. They have to because these OSes need to support:

  • the user being root on their own computer
  • In place OS updates

(What about systems that don’t support "in place" updates? Yes, there are people/organizations who e.g. build a new cloud image for every change, and often don’t even enable ssh or any persistent writable state for the OS. This is fine, but one problem is it doesn’t generally apply outside of cloud/IaaS environments on e.g. bare metal machines, and it can make upgrades for small changes very disproportionately expensive. )

Back to operating systems with state that some people call "immutable":

But /usr is read-only!

Yes. And this does have some security benefits, e.g. this runc vulnerability isn’t exploitable.

But in order for the operating system to be updated in place, there must be some writable area to add new OS content – so it’s not immutable. The details of this vary; a number of "image based" operating systems use dual partitions, OSTree is based on hardlinking with a "hidden" writable data store.

The real reason to have a read-only /usr is to make clear that the content of that directory (the operating system binaries) are "fully managed" or "owned" by the OS creator – you shouldn’t try to overwrite or replace parts of it because those changes could be overwritten by a future update.

And this "changes in /usr being overwritten" is a real existing problem with traditional package-manager systems (pkgmgrs). For example, a while ago I was looking at Keylime and came across this bit in the installer. That change would be silently overwritten by the next yum/apt update, so the system administrator experience would be:

  • Provision system
  • Install things (including keylime)
  • ⌛ Time passes
  • Apply OS updates (not on by default), then keylime breaks for a not obvious reason

The more correct thing instead would be for that playbook to write a systemd drop in in /etc to override just ExecStart=, although even doing that is fragile and what’d be best here is to make this an explicitly configurable option for tpm2-abrmd in a config file in /etc.

The overall point is that the reason /usr read-only is primarily to enforce that user configuration is cleanly separate from the OS content – which becomes particularly important when OS updates are automatic by default, as they are in Fedora CoreOS.

I think having automatic updates on by default fundamentally changes the perception of responsibility around updates; if I’m a system administrator and I typed apt/yum update and things broke, it’s my fault, but if automatic updates are on by default and I’m doing something else and the machine just falls over – it’s the OS vendor’s fault. Linking these two together: Since Fedora CoreOS has automatic updates on, we really need to be clear what’s our responsibility and what’s yours.

Now, this isn’t a new problem, and most people maintaining systems know not to do the kinds of things that Keylime Ansible playbook is doing. But it’s an extremely easy mistake to make without strong discipline when /usr is sitting there writable by any process that runs as root. I’ve seen many, many examples of this.

Nothing actually stops traditional package managers from mounting /usr read-only by default – they could do the equivalent of unshare -m /bin/sh -c 'mount -o remount,rw /usr && apt update‘ internally. But the challenges there grow into adjusting the rest of the filesystem layout to handle a readonly /usr, such as how OSTree suggests moving /usr/local to /var/usrlocal etc.

Image based updates

Usually instead of talking about an "immutable" system that allows in place updates, it’d be more useful and accurate to say "image based".

And this gets into another huge difference between traditional package managers and image based systems: The amount of "internal state".

The way most package managers work is when you type $pkgmgr install foo, the fact that you want foo installed is recorded by adding it to the database. But the package manager database also includes a whole set of "base packages" that (usually) you didn’t choose. Those "base packages" might come from a base container when you podman/docker pull, for cloud images the default image, and physical systems they often come from a distribution-specific default list embedded/downloaded from the ISO or equivalent.

A problem with this model then is "drift" – by default if the distribution decides to add a package to the base set by default, you (usually) don’t get it by default when applying in place updates since most package managers just update the set of packages you have. One solution to this is metapackages, but if not everything in the base is covered you still have drift that can be hard to notice over time.

I think for users of many pkgmgrs this "initial state" is hard to disentangle from things you typically do care about like the packages you chose to install. There is e.g. apt-mark showmanual and dnf history userinstalled commands.

And…trying that out by pulling the docker.io/debian:stable image, it claims:

# apt-mark showmanual
iproute2
iputils-ping
#

And that’s the first command I ran in the image! Clearly a bug somewhere. For the fedora:32 base image it lists a bunch of packages that correspond to the bits in the base kickstart – but that’s not something I as the user wrote.

By analogy with /usr vs /etc – this is like mixing local configuration in /usr.

This problem extends beyond the "user installed" database: traditional package managers aren’t aware of the "base bootimage" which operates on a separate infrastructure layer. apt has no idea about the of the OpenStack image/AMI/qcow2 or whatever that formed its initial state, nor is it aware of the OCI/docker container initial image (and conversely, e.g. podman/docker have no idea that yum/apt etc. are running inside).

So over time, the state of the system with traditional pkgmgrs is a function of many things:

  • Which packages you chose to install (obviously)
  • The set of packages from the initial "bootimage" or container image
  • More subtle things like which packages are in the "user installed" database
  • Even more subtle things can happen when weak dependencies like Recommends change in upstream packages
  • The package manager version: RHEL8 yum has autoremove on by default, RHEL7 and older yum doesn’t

One solution to this type of "drift" is to not use packages at all (pure "base OS" + "apps/containers") like Google COOS, or to group things at a higher level (Clear Linux is more in this bucket).

I’m pretty happy though with the design we came up with for rpm-ostree used by Fedora CoreOS/Silverblue/IoT; there is a clear "base commit" that comes in OSTree format, and you can add packages on top – recasting RPMs as "operating system extensions" (see also this OpenShift enhancement).

For rpm-ostree it’s really simple – by default it operates in pure ostree mode by default, so if you don’t layer/override any packages you are exactly replicating an ostree commit – and that’s it! You don’t need to think about packages by default.

Particularly for Fedora CoreOS, there is almost nothing in the "bootimage" (ISO, AMI equivalent) that isn’t part of the ostree commit.

In other words, "state of installed software" is a function of (effectively) one thing by default:

  • The ostree commit

It’s even stronger than that really, it’s not just "same packages" it’s "bit for bit identical /usr filesystem". However, there is one important note: /boot does come from the bootimage, see this issue.

Bootloader aside, effectively all of the OS state you care about then does not depend on which bootimage you happend to use to install initially. When OSTree performs an update, it does not matter what the "previous" commit was – the old and new implicitly share files via the hardlink store, but updates always involve a "fresh checkout" of the new commit. Every upgrade is like a fresh OS install of that version with your configuration (/etc) and state (/var) re-applied.

With rpm-ostree being a hybrid system, you can choose to engage package layering (or overrides). But the system very clearly highlights that list; note a major simplification is combining the "packages you installed" and "user installed" lists. The rpm-ostree model is very simple: you have a "base commit/image" and your extensions. For example:

$ rpm-ostree status -b
State: idle
BootedDeployment:
● ostree://fedora/32/x86_64/silverblue
                   Version: 32.2 (2020-08-22T17:28:53Z)
                BaseCommit: 080312021f34c7763089ff12fcd2964647e0f55ac3981f869b56d232a33990f6
           LayeredPackages: fish libvirt tmux virt-manager

An important but subtle detail in achieving this simplification: by default, rpm-ostree doesn’t allow marking a base package as user installed. Generally the idea is that removing user-interesting packages from the base image is something you shouldn’t do.

rpm-ostree goes to some lengths internally to make this split happen; the libdnf/rpm layers don’t have any model of "base image" because everything’s a package to them.

Has anti-hysteresis properties

I know "has anti-hysteresis properties" is an awkward phrase (and I’m happy to hear alternatives) but I think hysteresis is a great term that we should start using in computing. Today it seems to mostly be used in the sciences but I propose adopting it – this in the spirit of making computer science more like a real science.

Let’s take a look specifically at elastic hysteresis because it’s easy to understand and even try at home.

Basically, rubber bands have "hysteresis" ("hidden state"/"memory") which comes from how much it was stretched in the past. And this state is basically impossible to see by just looking at the rubber band. For a related example with rubber, see the two balloon experiment.

To tie together the previous section on package managers with this:

Systems managed by traditional package managers (apt/yum/etc) have a lot of effective hysteresis. I think even many experienced system administrators would have trouble confidently and precisely explaining how the multiple things listed above (the container or IaaS base image, package manager user installed database, etc.) all interact in forming the final state of the system over time as in-place upgrades are applied.

Configuration management systems and hysteresis

This "hysteresis" problem occurs not just in package managers but also many configuration management systems (puppet/ansible/etc).

A simple example I’ve seen happen is where the system administrator writes a playbook (or equivalent) that does e.g.:

- name: Allow nopasswd for wheel
  lineinfile:
    path: /etc/sudoers
    state: present
    regexp: '^%wheel ALL='
    line: '%wheel ALL=(ALL) NOPASSWD: ALL'

Then later, say the organization wants to change to use a separate group instead of wheel, say admins or whatever.

If the playbook is changed in git to do:

- group:
    name: admin
    state: present
- name: Allow nopasswd for admins
  lineinfile:
    path: /etc/sudoers
    state: present
    regexp: '^%admin ALL='
    line: '%admin ALL=(ALL) NOPASSWD: ALL'

The previous change to modify wheel in /etc/sudoers will silently persist (until the system is reprovisioned). And that could become a security problem even in this case.

In most of these configuration management systems, in some cases the admin may need to explicitly add a change which reverts a prior change, and then makes the new change. But not all of the time – some (most) changes don’t need this.

It’s an easy mistake to make when writing effectively arbitrary code to change files in persistent filesystems.

Hence, configuration management systems are subject to hysteresis too, and I think many of them could do better in warning users about this, and pushing for better practices. For example, the playbook would be more "anti-hysteresis" if it wrote to /etc/sudoers.d/mycustom.conf which gets replaced entirely, though /etc/sudoers.d is only supported by relatively modern sudo I think.

Kubernetes is fairly opinionated in having code in container images you pull (equivalent of /usr), and storing configuration in a configmap (which would get projected into environment variables or files in /etc). When you update a deployment, all state in the (sadly writable by default) pod container filesystem is thrown away, and there’s also no leakage from any previous version of a configmap. So we could say that the Kubernetes approach to applications has strong "anti-hysteresis properties".

The OpenShift Machine Config Operator defaults to anti-hysteresis

Tying together the Kubernetes and operating system threads: in OpenShift 4, the machine-config-operator allows you to write config files and systemd units into the operating system /etc by using kubectl/oc. (The original goal of etcd was in fact to do this, then Kubernetes happened and the focus shifted to that layer. In OpenShift 4 we are meeting that original goal of storing the Unix /etc in etcd via the MCO.)

The reason I claim the MCO has "anti-hysteresis" is it keeps track of the old and new system states reliably and is able to diff them. For example, if you write a config file for chrony to set the timeserver, then later kubectl delete machineconfig/my-chrony-config since you’re fine with the default, the MCO will notice that the old config wrote /etc/chrony.conf and the new one doesn’t, and it will correctly revert the file back to the default.

Just like OSTree has a checksum describing the state of /usr, the MCO maintains a checksum for its state and when you look at a node, you can say its configuration is e.g. rendered-master-<checksum>. If a system can describe its state with a checksum, that implies it has strong anti-hysteresis properties.

Now, there are holes in this model. If for example instead of writing a file directly, you create a systemd unit which does e.g. ExecStart=/bin/echo somedata > /etc/someotherfile, and you later delete that unit – the file will persist. The reason why relates to this FAQ.

A general pattern here is that any place you have arbitrary code that changes over time writing to persistent files, you’re at risk of hysteresis (or "unmanaged state").

Reprovisionable

OK, so systems with anti-hysteresis properties are good. But in practice, I think there’s always going to be that small amount of "unmanaged state" that sneaks in even in organizations with strong discipline. For example, a system administrator trying to debug one node and using ssh to edit a file directly to increase the debug level of a service, and then later that causes a problem by flooding the log system or causing more I/O to the local filesystem and increasing latency for other services.

And this problem isn’t just at the operating system layer; at the IaaS/CaaS layer it’s easy to have VMs or containers that were created manually to debug something and then "leak" unless actively removed.

In an IaaS deployment there are a wide variety of objects in general (storage buckets, SaaS etc.) and equally many tools to deal with leaks at that level; usually this boils down to a "resource tagging" approach. (One thing I think is nice about GCP over e.g. AWS is the "project" approach, specifically this bit: "This model can also be useful for testing purposes: once you’re done with a project, you can delete the project, and all of the resources created by that project will be deleted as well.")

At the operating system level (and at the IaaS level if you can too), I think a good way to deal with this is to periodically reprovision, e.g. once a month (if you can do faster, great) on a rolling basis. In OpenShift 4 for example with the machine-api-operator that would just be a small amount of code (a custom controller running as a pod) to periodically kubectl delete machine/<somemachine> based on whatever criteria you want – the platform will handle the rest, spinning up a new one to take its place. Currently this only applies to workers but I hope we can cover the control plane in future releases. A neat thing about this is that the IaaS layer (virtual machines) are just Kubernetes custom resources that are managed via the cluster.

Conclusion: We want reprovisonable, anti-hysteresis systems

There a whole lot of current terms for what I’ve covered above, "gitops", "managed configuration", "cattle", stateless", "immutable infrastructure" etc. I’m suggesting the goal is: reprovisionable infrastructure with anti-hysteresis properties. But, I’d also be happy if we used "reprovisionable" instead of "cattle", and also if we introduced the term "anti-hysteresis" instead of "immutable" (where applicable).


4 comments

  1. Small detail:
    > the playbook would be more “anti-hysteresis” if it wrote to /etc/sudoers.d/mycustom.conf

    It won’t work. In sudo manual you will find this little gem:

    > sudo will […] read each file in /etc/sudoers.d, skipping file names that end in ‘~’ or contain a ‘.’ character

    So “.conf” needs to go away. I got burned by this in the past. Thank you, sudo, for such non-intuitive behavior.

  2. I’d have expected a lot more manually installed packages on the Debian system. Manual vs automatic is not which packages you installed since the initial install – manual forms the root set of what not to remove. In an initial system, it needs to contain all non-essential (as Essential packages are always reinstalled by apt) packages that don’t have a dependency on them. In Ubuntu, we build the images with all packages manually, and then mark everything reachable from a metapackage as automatically installed, but the meta packages would be manual.

    It’s not about what you have manually installed per se, but rather what you care about vs something that can be removed automatically

  3. This is a super interesting topic!

    Re. Ansible, I think the declarative idempotent model it’s going for does make concrete improvements and is better at tackling unmanaged state than many ad-hoc approaches. So it at least tries to go into the right direction. But definitely, there’s a lot that remains on the table as you mentioned. One random idea is that in theory Ansible could keep on the host somewhere a ledger of things it did, so that it could support *reverting* actions which are no longer described in the playbook; and the reason it can do that is because of its declarative model (modulo escape-hatch modules like `command`, so it’d have to be a “strict declarative mode” or something).

    This reminded me of when we used jenkins-job-builder. The `update` command (which actually creates the Jenkins jobs from your YAML) had this `–delete-old` switch to remove all the jobs which are no longer described in the YAML. So then you could be sure that the state of Jenkins (more or less) matched your input. And there’s `rsync`’s `–delete` flags too of course. And Ignition’s and rpm-ostree’s declarative model, etc…

    I think where this leads me is that another way to look at this “anti-hysteresis” property is whether your workflow is “wholly declarative”, i.e. the whole system is described by the inputs and the process by which a tool translates those inputs into state mutation lacks/minimizes [side effects](https://en.wikipedia.org/wiki/Side_effect_(computer_science)). Side effects are what leads to unmanaged state, which leads to hysteresis/history dependence.

    • It’s important to emphasize this isn’t *just* Ansible – e.g. I believe Puppet also has this problem with its basic `file` resource.

      There was a thread on Twitter that mentioned mgmtconfig has a “reversibility” property that aims to address this: https://twitter.com/purpleidea/status/1297348918959919107 – your “ledger” concept seems similar.

      The MCO special cases this for `/etc` by keeping track of the previous file (which ostree already has but the MCO wants to handle traditional RHEL too).

      But it seems hard to generalize this without a system that can do “reset back to default state, apply these changes” (i.e. how ostree does updates including /etc modifications). And config management systems generally aren’t in a position to do that.

      > I think where this leads me is that another way to look at this “anti-hysteresis” property is whether your workflow is “wholly declarative”, i.e. the whole system is described by the inputs and the process by which a tool translates those inputs into state mutation lacks/minimizes [side effects](https://en.wikipedia.org/wiki/Side_effect_(computer_science)).

      You are right that the term “declarative” is commonly used today for this “anti-hysteresis” property. For example NixOS talks about that. I think hysteresis is a better term basically because “side effects” are really hard to avoid in general, and also a lot of systems that *look* declarative are still subject to this. For example, systemd unit files look very declarative, but mutating them in `/usr` is a problem. Writing to `/etc` still seems like a “side effect” right? It’s just that by doing it that way, we don’t create a dependency on time/ordering (os update after writing file is a problem).

      So I still think “anti-hysteresis” better captures the problem domain than “declarative” for those reasons.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s