systemd is really well designed

One of the things I think has generally worked well about “Linux” and the ecosystem on top of it has been the variety of userspace.  There’s obviously some pointless things, but also some genuine innovation.  It works well when upstream projects are structured in a way that they can be mixed and matched.

For Fedora CoreOS we are combining two technologies; Ignition and rpm-ostree.  Previously they were used independently (Ignition with a ChomeOS style A/B updater) and rpm-ostree with the traditional Fedora-and-derivatives setup of Kickstart for bare metal, and cloud-init for clouds.

Putting the two together has been working well so far, but I’ve recently been working on support for root filesystem reprovisioning which is where the two projects intersect strongly.  This has meant a lot of time writing code in the initramfs.

And the topic of this blog is “systemd is well designed” because the design for systemd in the initramfs is very flexible and also well documented.  We’re continuing to support Ignition independent of OSTree, as well as OSTree independent of Ignition, while also having both of them work together.  Further, the projects are written in different languages; Ignition is Go, OSTree is C, we have the usual (unfortunate) mix of shell script in there, and it’s likely we’ll add some Rust soon too.

This is where systemd excels; we can plug these things together in a coordinated fashion by writing unit files with careful dependencies.  They can communicate however they want; in practice, writing files in /run is a common pattern.

Also worth noting is we’re using dracut, which is itself independent of systemd, designed to just implement the systemd boot sequence – our units plug into the “custom initrd services” section.  And it all Just Worked.  The systemd initramfs boot sequence (and the boot sequence in general) was designed long before either OSTree or Ignition were created, but it’s stood the test of time.

OpenShift 4: Streamlining RHEL as Kubernetes-native OS

Been a while since I’ve blogged here, going to try to do so more often!  For quite a while now in the CoreOS group at Red Hat I’ve been part of a team working to create RHEL CoreOS, the cluster-managed operating system that forms a base of the just-released OpenShift 4.

With OpenShift 4 and RHEL CoreOS, we have created a project called machine-config-operator – but I like to think of it as the “RHEL CoreOS operator”.  This is a fusion of technologies that came from the CoreOS acquisition (Container Linux, Tectonic) along with parts of RHEL Atomic Host, but with a lot of brand new code as well.

What the MCO (machine-config-operator) does is pair with RHEL CoreOS to manage operating system updates as well as configuration in a way that makes the OS feel like a Kubernetes component.

This is a radically different approach than the OpenShift 3.x days, where the mental model was to provision + configure the OS (and container runtime), then provision a cluster on top.   With OpenShift 4 using RHCOS and the MCO, the cluster controls the OS.

If you haven’t yet, I encourage you to dive right in and play around with some of the example commands from the docs as well as examples from the upstream repository.  There is also my 2019 talk (slightly dated now).

The release of 4.1 of course is just a beginning – there’s a whole lot more to do to bridge the worlds of the “traditional” operating system and Kubernetes/OpenShift.  For example, in git master of the MCO (for the next release after 4.1) we landed support for kernel arguments.  I think it’s quite cool to be able to e.g. oc edit machineconfig/50-nosmt, change the KernelArguments field in the MachineConfig CRD, add e.g. nosmt (or any other karg) and watch that change incrementally roll out across the cluster, reconciling the OS state just like any other Kubernetes object.

The links above have lots more detail for those interested in learning more – I’ll just link again operating system updates as I think that one is particularly interesting.

This release of OpenShift 4.1 is laying a powerful new foundation for things to come, and I’m really proud of what the teams have accomplished!


On dm-verity and operating systems

TL;DR: I posit that dm-verity is most useful if one is making a true fixed purpose device that has extremely limited configuration. If one allows installing (unprivileged) software, the protection is weaker. And if it’s an intentional design feature of the OS to allow persistently installing privileged software, the value of dm-verity plummets significantly.

I am one of the upstream maintainers of the ostree project which is comparable with projects that do A/B style partition updates for operating systems, although it’s implemented at the filesystem and not the block level. There’s a a bit more on related projects here.

We got a request to investigate dm-verity, and I wrote down some preliminary thoughts. However, since then I spent a while thinking about it, and the benefits/drawbacks of dm-verity.

As I mention in the TL;DR section, I’m going to claim that dm-verity is best when the machine/device has limited configuration (config files should not be arbitrary code) and no ability to install software. For example, take a “WiFi camera”. These types of devices are obviously in the news for security issues.

What can dm-verity do for appliance-type systems?

Let’s say the device’s web interface has a flaw that allows an attacker on the local network to gain code execution; for example, command injection. However, the device manufacturer has properly implemented dm-verity, and every persistent mount point is read-only and verity protected. This is a significant barrier to the attacker maintaining persistence across a reboot. Concretely, one could unplug the camera, plug it into a secure network, allow it to download an OS update fixing the vulnerability, and have some confidence the exploit hasn’t persisted.

However, even that said, there are limits to the value here. dm-verity does not help you with the attacker monitoring the camera and spying on you; for example if it has a view of one of your offices, it could be recording your typed passwords. Attackers could use it to launch attacks on other devices on the network until it’s been rebooted. This article gives an example of nation-state level malware that lived “in the network”, not persistently on disk; in order to remove it the organization had to reboot everything at the same time.

Dm-verity on non-appliance systems

The ChromeOS trusted boot design docs have a section titled “Known weaknesses of verified boot”:

While verified boot can ensure that the system image (i.e. firmware, kernel, root file system) are protected against tampering by attackers, it can’t protect data that must inherently be modifiable by a running system. This includes user data, but also system-wide state such as system configuration (network, time zone, keyboard layout, etc.)…

One very interesting thing here is the fundamental difference between the original ChromeOS design (a device to just run Chrome i.e. web pages, no 3rd party non-browser software at all) and Android, which is obviously all about “apps”. Modern Android does use dm-verity; as I understand it the ChomeOS and Android projects are trying to merge some technologies, which includes the OS update mechanism.

On Android, apps are “unprivileged” or non root software, without Linux capabilities. But from a user perspective of course, applications can do quite a lot; similar to the WiFi camera case, attackers are likely perfectly happy injecting “unprivileged” Android applications that can monitor your location, microphone etc. Besides the well-known issues with Android devices not receiving security updates, there is a good example of a privilege escalation issue in Android called Cloak And Dagger; applications can exploit the accessibility framework to escalate their privileges, including full keystroke recording.

That said, persisting in an application does increase the chance an attacker could be detected. And if one suspects an Android device is compromised, dm-verity does provide value in that one can do a factory reset, and a bit like the WiFi camera scenario, do an OS update (before reinstalling apps), and have some confidence that the malware hasn’t persisted.

Dm-verity on full general purpose systems

A fully general purpose operating system needs to allow the installation of privileged code as well. An example of an OS that uses dm-verity and allows 3rd party code to execute with full (i.e. Linux CAP_SYS_ADMIN privileges) is CoreOS (yes, I know they renamed it to “Container Linux” but sorry, I think that’s silly, I’m going to keep calling it “CoreOS” 😃 ).

Installing a tool like Kubernetes on top of CoreOS requires it to be fully privileged to do its job (specifically the kubelet). Having a mechanism to install privileged software persistently means that same mechanism can be used by malware. While it’s true the malware doesn’t need to live in the /usr directory, unlike the non-configurable camera scenario, a software update and reboot isn’t going to fix things.

Also on CoreOS, attackers can write fully privileged unit files in /etc/systemd/system/, or the classic Unix things writing /root/.bashrc. These are all places where malware can persist across reboots. dm-verity in theory does make detection easier – but most system administrators are going to find it easier to simply re-provision their systems, and not look carefully at all of the files in /etc.

Ostree-style flexibility vs fixed block devices

Now let’s examine what an ostree-based system like Fedora Atomic Host does to help with preventing these types of hacks? Unfortunately, the answer is nothing! Atomic Host systems are equally general purpose. Since you can e.g. configure the system to set a HTTP proxy, and attacker could create a systemd unit file that runs ExecStart=curl | sh. Further, the OS data in /sysroot/ostree/repo isn’t verity protected; it’s just data in a filesystem, just like RPMs/debs etc. And for that matter, just like Docker overlay2 container files.

Why not implement dm-verity anyways? The answer is that I think it’s more valuable to have 3rd party software installation more tightly integrated with the host. We’re working on system containers for Kubernetes for example – these system containers have part of their configuration on the host, and configuration files down the line are going to be tracked by RPM. And outside of the container space, rpm-ostree supports “package layering”, which brings the best features of image update systems with the flexibility of package systems. You can use package layering to install privileged software like PAM modules, kernel drivers and the like. We recently landed the first experimental support for live system updates. This would be technically much harder if we operated at the block level, which dm-verity would force us into. Not to mention deep questions around signing of the bootable hash.

Package layering is crucial to provide flexiblity for “small scale” or “pet” machines. rpm-ostree allows you to use yum/apt/zypper style workflows,, and still get the benefits of image-like approaches. Such as known-good “base image”, transactional updates and “offline” updates. For example, with rpm-ostree you can uninstall your layered packages, and this will return the system to exactly to the “base image” in /usr.

Some people I’ve talked to about package layering don’t like the idea of still doing package installs per-machine. This is often the “large scale identical machine” cases – racks of identical servers (or at least ones that can use the same OS image), and “corporate standard build” laptops. In the large scale server case, organizations would prefer doing a “custom compose”, baking in their configuration to the images.

This goes back to a potential dm-verity scenario; in this model, we’d really want /etc to be immutable at runtime. Traditional files that need to be modified at runtime under /etc like /etc/resolv.conf would be a symlink into /run. Other “persistence vectors” like /usr/local and /root would need to be verity-protected too. The only writable, persistent filesystem should be /var. We’d also need to audit the operating system to make sure that no code can live in /var. A quick inspection shows there’d be work to do here; for example, I suspect /var/cache/ldconfig/aux-cache is used by the dynamic linker. There’s also /var/lib/alternatives. Hm, I notice my workstation has/var/spool/at – a cron job would be an excellent persistence vector too.

This sounds relatively doable. Get rid of things like at (Fedora Atomic Host already doesn’t have either the legacy cron or at – we suggest people use systemd timers). Moving that type of configuration underneath either /etc or /usr, which is what the “systemd config model” does, and those directories are read-only at runtime.

But going back to the high level – for general purpose operating systems, I’d take the flexibility of rpm-ostree’s dynamic package layering over having dm-verity for just a subset of privileged code. Being able to seamlessly install utilities on the host is very useful. We’ve even landed some recent work on replacing parts of the “base image”. I don’t want to build a new OS image every time I wanted to test a new version of docker or systemd, at least in a dev/test cycle.

I think there’s a spectrum here – with the “ostree model” enforcing read-only constraints around /usr, we are supporting iteration towards the more locked down “verity appliance” style devices. I know there are both ostree (and rpm-ostree) users today who are willing to drop some of the flexibility for increased security. If you’re one of those, please do follow the upstream issues linked above!

Concretely, you could build a tool that takes a kickstart configuration (your requested partitioning, time zone, etc.) plus generic %post style configuration (extra PAM tweaks, Docker registries), plus layered packages, plus container images, (and container runtime configuration?) and put all of that into a disk image, signed with dm-verity.

A challenge here is a lot of organizations are going to want branching. If one wants to update to a new version of Kubernetes/OpenShift, that would require a new image build. Organizations are going to want multiple active versions, to try out new OS builds in staging. Changing any configuration file that lives in /etc would also be a new image build. There are clearly files in /etc where a “heavyweight” change process could make sense; for example, the CA trust roots in /etc/pki.

Back to my original thesis, the dm-verity approach is best for IoT/appliance devices with truly limited configuration. As soon as you have any persistent place to write configuration/code that isn’t verity protected, its value drops.

Peer review, FOSS, and packaging/containers etc

Lately whenever I give a presentation, I often at least briefly mention one of my primary motivations for doing what I do:  I really like working in global community of people on Free Software.

A concrete artifact of that work is the code landing in git repositories.  But I believe it’s not just about landing code – peer review is a fundamental ingredient.

Many projects of course start out as just one person scratching an itch or having fun.  And it’s completely fine for many to stay that way.  But once a project reaches a certain level of maturity and widespread usage, I think it’s generally best for the original author to “step down” and become a peer.  That’s what I’ve now done for the OSTree project.

In other words, landing code in git master for a mature project should require at least one other person to look at it.  This may sound obvious, but you’d be surprised…there are some very critical projects that don’t have much the way of peer review.

To call out probably the most egregious example, the bash shell.  I’m deliberately linking to their “git log” because it violates all modern standards for git commit messages.  Now,  I don’t want to overly fault Chet for the years and years he’s put into maintaining the Bash project on his own time.  His contribution to Free Software is great and deserves recognition and applause.  But I believe that getting code into bash should involve more than just him replying to a mail message and running git push.  Bash isn’t the only example of this in what I would call the “Linux distribution core”.

Another major area where there are gaps are the “language ecosystems like Node.js, Rust’s cargo, Python’s pip etc.  Many projects on there are “one person scratching an itch” that other people mostly just consume.

There’s no magical solution to this – but in e.g. the language ecosystem case, if you happen to maintain a library which depends on another one, maybe consider spending a bit of your time looking at open pull requests and jumping in with review?

A vast topic related to this is “who is qualified to review” and “how intensively do I review”, but I think some qualified people are too timid about this – basically it’s much better to have a lightweight but shallow process than none at all.

Now finally, I included “packaging” in the title of this blog, so how does that relate?  It’s pretty simple, I also claim that most people doing what is today known as “packaging” should sign up to participate in upstream peer review.  Things like build fixes should go upstream rather than being kept downstream.  And if upstream doesn’t have peer review, reconsider packaging it – or help ensure peer review happens upstream!



Github, accounts, and ease of contribution

At the moment we’re making plans to move OSTree to Github (from GNOME), and while there are a few reasons for this, one thing I want to talk about is the “account problem” and specifically how it relates to free and open source software.

The “account problem” is simply that requiring users to create them is a barrier to contribution.   It’s problematic to require people to have a Sourceforge account, a GNOME account, a Github account, an Apache Bugzilla account, a Fedora/CentOS account, etc.  People who are committed to making a larger contribution can obviously easily overcome this, but for smaller contributions it hurts.

Particularly for projects like GNOME that have distinct accounts for bugzilla and commit.  Having to create an account just to file a bug is bad.  Yes, there’s OpenID, but still.

I’ll note at this point that software freedom is quite important to me, and the fact that Github is proprietary software is a problem.  But – making it easy for people to contribute to Free Software is also a major benefit.

I wonder how things would have turned out if Sourceforge had been…well, let’s say “less crappy”.  Anyways, now we have Github.

And when we move OSTree, I’d like to avoid becoming too dependent on it.  Particularly for things that aren’t actually git, like the issue tracker. Hopefully if GNOME doesn’t disagree, we’ll maintain our mailing list and bugzilla there so that people who prefer that can use it.

But allowing people to create Github PRs easily is really critical in my mind.  (On this topic, we are also planning to use the Homu bot, which I really like)


Thoughts on unikernels/rump kernels

I spend most of my time working on Project Atomic to further Linux containers deriving from a traditional upstream Linux distribution model, but the space of software delivery/runtime mechanisms is vast, and in particular, I have thought Unikernels were an interesting development.   While I do like writing C, the thought of an OS/library in a high level language is an interesting one (particularly interesting to me for a long time is how garbage collection could be better if integrated with the OS).

That was before Docker, Inc. acquired a unikernel company – now, I’m certainly curious where they’re going to go with it.

My thoughts before this were that the Unikernel model might make sense in the scenario where you have a “large” application and your sole deployment target is required to be virtualized (e.g. AWS, GCE, etc.).

In this case, it’s not really possible to share anything between virtual machines directly (modulo KSM and similar ad-hoc techniques which cost CPU and aren’t always predictable) – and so because you can’t share anything between these apps, it could gain you efficiency to dump the parts of the OS and userspace that you aren’t using in that VM, which could be a lot.

But, if you have any smaller microservice applications, it seems to me that having a shared kernel and userspace (as we provide with the Project Atomic and OpenShift 3 models) is going to be a lot more efficient than doing a VM-per-microservice, even if your VMs are unikernels.

And even with the “large app only for virt” scenario, what about debugging?  Ah yes, I just found a blog from Bryan Cantrill on this topic, and I have to say I agree.

Still though, there’s lots of middle ground here.  We can do far better at helping application authors to produce smaller apps (and host images) than we are with Docker normally right now, for example.

New Atomic Host verb: rpm-ostree deploy

TL;DR: We’ve improved the host version management in Fedora Atomic Host, and you can now use atomic host deploy $version to atomically switch to a well-known version.

Longer version:

The awesome Cockpit project has been working on a UI for managing Atomic Host/OSTree updates. See this page for some background on their design.

If you download the most recent Fedora Atomic Host release, then atomic host upgrade, you’ll get a new rpm-ostree release which in turn has a new “deploy” verb. This was created to help implement the above Cockpit design; it’s a command line talking to code equivalent to what the Cockpit UI pull request will use.

This is noteworthy for several reasons. First, it really unlocks the “server side history” aspect of OSTree for the host tree. This is similar to tagged builds in a Docker repository for a container.

In order to explain this, one needs to understand that currently in Fedora, there is at most one content release per day. This is true of the traditional single “big repository of RPMs”, and also the OSTree commits derived from that used for Atomic Host.

OSTree has support for a metadata key per commit called ostree.version which is what you see when you type atomic host status. At present, we’re implementing a model where the version numbers are of the form “$major.$increment”, and at the time of this writing the version is 23.33, or 33 commits from release.

With that background out of the way, the interesting thing about the new rpm-ostree deploy (mapped via atomic host deploy) command is it allows you atomically switch one or more in a cluster of machines to a pre-determined version you have tested and validated.

For example, if you’re trying the current Fedora Atomic Host build, you can invoke:

# atomic host deploy 23.32


Transaction complete; bootconfig swap: no deployment count change: 0
Freed objects: 2.1 kB
krb5-libs 1.14-3.fc23 -> 1.14-2.fc23
lua 5.3.2-2.fc23 -> 5.3.0-4.fc23
Run "systemctl reboot" to start a reboot

If you contrast this with the traditional yum update or atomic host upgrade – these commands will both by default pick the latest versions of the components. If the OS vendor is providing updates while you’re in the middle of an upgrade, you could get hosts with a mix of updated or not, with changes you haven’t validated.

Now of course, there are several projects which help in implementing versioning on top of the OS vendor content. The Pulp project is an example which allows importing upstream RPM (or other) content, and managing well-known snapshots of it. Then you configure your client machines to pull from those immutable snapshots, rather than directly from upstream.

Doing this sort of downstream repository management makes a lot of sense for greater than small scales – among other things one often wants local mirroring as well. But even with a local versioned content mirror, it can be very convenient to have the intelligence to traverse the repository history built into the client. It also helps the repository management case as it can reuse the upstream versions, rather than trying to synthesize them downstream.

There’s a lot more work to do on top of this of course. I just posted a proposal for reworking the commit stream which I think would make this nicer. And the above linked Cockpit pull request will be very cool to see!