LLMs and core software: human driven

It’s clear LLMs are one of the biggest changes in technology ever. The rate of progress is astounding: recently due to a configuration mistake I accidentally used Claude Sonnet 3.5 (released ~2 years ago) instead of Opus 4.6 for a task and looked at the output and thought “what is this garbage”?

But daily now: Opus 4.6 is able to generate reasonable PoC level Rust code for complex tasks for me. It’s not perfect – it’s a combination of exhausting and exhilarating to find the 10% absolutely bonkers/broken code that still makes it past subagents.

So yes I use LLMs every day, but I will be clear: if I could push a button to “un-invent” them I absolutely would because I think the long term issues in larger society (not being able to trust any media, and many of the things from Dario’s recent blog etc.) will outweigh the benefits.

But since we can’t un-invent them: here’s my opinion on how they should be used. As a baseline, I agree with a lot from this doc from Oxide about LLMs. What I want to talk about is especially around some of the norms/tools that I see as important for LLM use, following principles similar to those.

On framing: there’s “core” software vs “bespoke”. An entirely new capability of course is for e.g. a nontechnical restaurant owner to use an LLM to generate (“vibe code”) a website (excepting hopefully online orderings and payments!). I’m not overly concerned about this.

Whereas “core” software is what organizations/businesses provide/maintain for others. I work for a company (Red Hat) that produces a lot of this. I am sure no one would want to run for real an operating system, cluster filesystem, web browser, monitoring system etc. that was primarily “vibe coded”.

And while I respect people and groups that are trying to entirely ban LLM use, I don’t think that’s viable for at least my space.

Hence the subject of this blog is my perspective on how LLMs should be used for “core” software: not vibe coding, but using LLMs responsibly and intelligently – and always under human control and review.

Agents should amplify and be controlled by humans

I think most of the industry would agree we can’t give responsibility to LLMs. That means they must be overseen by humans. If they’re overseen by a human, then I think they should be amplifying what that human thinks/does as a baseline – intersected with the constraints of the task of course.

On “amplification”: Everyone using a LLM to generate content should inject their own system prompt (e.g. AGENTS.md) or equivalent. Here’s mine – notice I turn off all the emoji etc. and try hard to tune down bulleted lists because that’s not my style. This is a truly baseline thing to do.

Now most LLM generated content targeted for core software is still going to need review, but just ensuring that the baseline matches what the human does helps ensure alignment.

Pull request reviews

Let’s focus on a very classic problem: pull request reviews. Many projects have wired up a flow such that when a PR comes in, it gets reviewed by a model automatically. Many projects and tools pitch this. We use one on some of my projects.

But I want to get away from this because in my experience these reviews are a combination of:

Extremely insightful and correct things (there’s some amazing fine-tuning and tool use that must have happened to find some issues pointed out by some of these)
Annoying nitpicks that no one cares about (not handling spaces in a filename in a shell script used for tests)
Broken stuff like getting confused by things that happened after its training cutoff (e.g. Gemini especially seems to get confused by referencing the current date, and also is unaware of newer Rust features, etc)

In practice, we just want the first of course.

How I think it should work:

A pull request comes in
It gets auto-assigned to a human on the team for review
A human contributing to that project is running their own agents (wherever: could be local or in the cloud) using their own configuration (but of course still honoring the project’s default development setup and the project’s AGENTS.md etc)
A new containerized/sandboxed agent may be spawned automatically, or perhaps the human needs to click a button to do so – or perhaps the human sees the PR come in and thinks “this one needs a deeper review, didn’t we hit a perf issue with the database before?” and adds that to a prompt for the agent.
The agent prepares a draft review that only the human can see.
The human reviews/edits the draft PR review, and has the opportunity to remove confabulations, add their own content etc. And to send the agent back to look more closely at some code (i.e. this part can be a loop)
When the human is happy they click the “submit review” button.
Goal: it is 100% clear what parts are LLM generated vs human generated for the reader.

I wrote this agent skill to try to make this work well, and if you search you can see it in action in a few places, though I haven’t truly tried to scale this up.

I think the above matches the vision of LLMs amplifying humans.

Code Generation

There’s no doubt that LLMs can be amazing code generators, and I use them every day for that. But for any “core” software I work on, I absolutely review all of the output – not just superficially, and changes to core algorithms very closely.

At least in my experience the reality is still there’s that percentage of the time when the agent decided to reimplement base64 encoding for no reason, or disable the tests claiming “the environment didn’t support it” etc.

And to me it’s still a baseline for “core” software to require another human review to merge (per above!) with their own customized LLM assisting them (ideally a different model, etc).

FOSS vs closed

Of course, my position here is biased a bit by working on FOSS – I still very much believe in that, and working in a FOSS context can be quite different than working in a “closed environment” where a company/organization may reasonably want to (and be able to) apply uniform rules across a codebase.

While for sure LLMs allow organizations to create their own Linux kernel filesystems or bespoke Kubernetes forks or virtual machine runtime or whatever – it’s not clear to me that it is a good idea for most to do so. I think shared (FOSS) infrastructure that is productized by various companies, provided as a service and maintained by human experts in that problem domain still makes sense. And how we develop that matters a lot.

Thoughts on agentic AI coding as of Oct 2025

Sandboxed, reviewed parallel agents make sense

For coding and software engineering, I’ve used and experimented with various frontends (FOSS and proprietary) to multiple foundation models (mostly proprietary) trying to keep up with the state of the art. I’ve come to strongly believe in a few things:

Agentic AI for coding needs strongly sandboxed, reproducible environments
It makes sense to run multiple agents at once
AI output definitely needs human review

Why human review is necessary

Prompt injection is a serious risk at scale

All AI is at risk of prompt injection to some degree, but it’s particularly dangerous with agentic coding. All the state of the art today knows how to do is mitigate it at best. I don’t think it’s a reason to avoid AI, but it’s one of the top reasons to use AI thoughtfully and carefully for products that have any level of criticality.

OpenAI’s Codex documentation has a simple and good example of this.

Disabling the tests and claiming success

Beyond that, I’ve experienced multiple times different models happily disabling the tests or adding a println!("TODO add testing here") and claim success. At least this one is easier to mitigate with a second agent doing code review before it gets to human review.

Sandboxing

The “can I do X” prompting model that various interfaces default to is seriously flawed. Anthropic has a recent blog post on Claude Code changes in this area.

My take here is that sandboxing is only part of the problem; the other part is ensuring the agent has a reproducible environment, and especially one that can be run in IaaS environments. I think devcontainers are a good fit.

I don’t agree with the statement from Anthropic’s blog

without the overhead of spinning up and managing a container.

I don’t think this is overhead for most projects because Where it feels like it has overhead, we should be working to mitigate it.

Running code as separate login users

In fact, one thing I think we should popularize more on Linux is the concept of running multiple unprivileged login users. Personally for the tasks I work on, it often involves building containers or launching local VMs, and isolating that works really well with a full separate “user” identity. An experiment I did was basically useradd ai and running delegated tasks there instead. To log in I added %wheel ALL=NOPASSWD: /usr/bin/machinectl shell ai@ to /etc/sudoers.d/ai-login so that my regular human user could easily get a shell in the ai user’s context.

I haven’t truly “operationalized” this one as juggling separate git repository clones was a bit painful, but I think I could automate it more. I’m interested in hearing from folks who are doing something similar.

Parallel, IaaS-ready agents…with review

I’m today often running 2-3 agents in parallel on different tasks (with different levels of success, but that’s its own story).

It makes total sense to support delegating some of these agents to work off my local system and into cloud infrastructure.

In looking around in this space, there’s quite a lot of stuff. One of them is Ona (formerly Gitpod). I gave it a quick try and I like where they’re going, but more on this below.

Github Copilot can also do something similar to this, but what I don’t like about it is that it pushes a model where all of one’s interaction is in the PR. That’s going to be seriously noisy for some repositories, and interaction with LLMs can feel too “personal” sometimes to have permanently recorded.

Credentials should be on demand and fine grained for tasks

To me a huge flaw with Ona and one shared with other things like Langchain Open-SWE is basically this:

Sorry but: no way I’m clicking OK on that button. I need a strong and clearly delineated barrier between tooling/AI agents acting “as me” and my ability to approve and push code or even do basic things like edit existing pull requests.

Github’s Copilot gets this more right because its bot runs as a distinct identity. I haven’t dug into what it’s authorized to do. I may play with it more, but I also want to use agents outside of Github and I also am not a fan of deepening dependence on a single proprietary forge either.

So I think a key thing agent frontends should help do here is in granting fine-grained ephemeral credentials for dedicated write access as an agent is working on a task. This “credential handling” should be a clearly distinct component. (This goes beyond just git forges of course but also other issue trackers or data sources that may be in context).

Conclusion

There’s so much out there on this, I can barely keep track while trying to do my real job. I’m sure I’m not alone – but I’m interested in other’s thoughts on this!

Why bootc doesn’t require “/usr merge”

The systemd docs talk about UsrMerge, and while bootc works nicely with this, it does not require it and never will. In this blog we’ll touch on the rationale for that a bit.

The first stumbling block is pretty simple: For many people shipping “/usr merge” systems, a a lot of backwards compatibility symlinks are required, like /bin → /usr/bin etc. Those symbolic links are pretty load bearing, and we really want them to also not just be sitting there as random mutable state.

This problem domain really scope creeps into “how does / (aka the root filesystem)” work?

There are multiple valid models; one that is viable for many use cases is where it’s ephemeral (i.e. a tmpfs) as encouraged by things like systemd-volatile-root. One thing I don’t like about that is that / is just sitting there mutable, given how important those symlinks are. It clashes a bit with things like wanting to ensure all read files are only from verity-protected paths and things like that. These things are closer to quibbles though, and I’m sure some folks are successfully shipping systems where they don’t have those compatibility symlinks at all.

The bigger problem though is all the things that never did “/usr move”, such as /opt. And for many things in there we actually really do want it to be read-only at runtime (and more generally, versioned with the operating system content).

Finally, /opt is just a symptom of a much larger issue that there’s no “/usr merge” requirement for building application containers (docker/podman/kube style) and a toplevel, explicit goal of bootc is to be compatible with that world.

It’s for these reasons that while historically the ostree project encouraged “/usr merge”, it never required it and in fact the default / is versioned with the operating system – defining /etc and /var as the places to put persistent machine local state.

The way bootc works by default is to continue that tradition, but as of recently we default to composefs which provides a strong and consistent story for immutability for everything under / (including /usr and /opt and arbitrary toplevels). There’s more about this in our filesystem docs.

In conclusion I think what we’re doing in bootc is basically more practical, and I hope it will make it easier for people to adopt image-based systems!

tar::Builder isn’t Send

I recently made a new project in Rust that is generating multiple bootable operating system disk image types from a "pristine" image with the goal of deduplicating storage.

At one point I decided to speed it up using rayon. Each thread here is basically taking a pristine base (read-only), doing some nontrivial computation and writing a new version derived from it. The code is using .par_iter().try_for_each(); here the rayon crate handles spinning up worker threads, etc.

That all worked fine.

Then later, due to some other constraints I realized it was better to support writing to stdout in addition. (This code needs to run in a container, and it’s easier to podman run --rm -i myimage --output stdout > newfile.iso instead of dealing with bind mounts.)

I came up with this:

enum OutputTarget<W: std::io::Write> {
    Stdout(W),
    Tar(tar::Builder<W>),
}

Basically if you’re only asking for one file, we output it directly. If you ask for multiple, we wrap them in a tarball.

But, it didn’t compile – there was an error message about tar::Builder not having the Send trait that pointed at the closure being passed to the rayon try_for_each(). I’ve been using Rust long enough that I understand Send and immediately realized the problem: multiple worker threads trying to concurrently write to the same tar stream just can’t work. (And the same is true for stdout, but the compiler can’t know for sure there’s only one thread in that case.)

But, I still wanted the parallelism from doing the actual file generation. Some refactoring to more cleanly split up "generate files" from "output files" would have been cleanest, and probably not hard.

But this project was still in the fast iteration/prototyping phase so I decided to just wrap the OutputTarget enum to be an Arc<Mutex<OutputTarget>> – and that compiled and worked fine. The worker threads still parallelize generation, then serialize output.

Other languages don’t do this

This project is one of those that honestly could have easily started in bash or Python too. Or Go. But those languages don’t have built-in concurrency protections.

Out of curiosity I just wrote a quick Python program to write to a tarfile from multiple threads. As expected, it silently generated a corrupted tarball with intermixed content. (At this point hopefully everyone knows basically to avoid threads in Python since they’re mostly useless due to the GIL, of course)

And also as expected, a lightly modified example of the code from the Go archive/tar example compiles fine, and generates corrupted output. Now this is a well known problem in Go given its heavy focus on concurrent goroutines, and to be fair go run -race does correctly find errors here. But there’s a bunch of tradeoffs involved there; the race detector is only probabilistic, you have to remember to use it in your CI tests, etc.

I’m really not saying anything here that hasn’t been said before of course. But this was my experience this week. And it’s experiences like this that remind me why I sunk so much time into learning Rust and using it for new projects.

Why I work on OpenShift and Fedora/RHEL

Every weekday for many years now I’ve woken up, dropped my kids off at school, then grabbed a coffee and sat down at my computer to work on OpenShift and Fedora+RHEL.

Doing this for so long, over time I’ve thought about and refined the why I do this, and I want to write it down so that I can refer to this in various places. Some of this is a more condensed/rephrased variant of this blog post.

I was inspired to be here originally (over 20 years ago) by the Free Software movement – one thing in particular I remember is seeing the Emacs start screen linking to the FSF website on our school’s Solaris workstations (In app advertisement worked!). Along with that, one thing I always found fascinating about software in general is the feeling of "the power of creation" – I can type something and make it happen.

Since then, software has become much, much more foundational to our society (in some cases, probably too much re: social media, etc). In particular here the rise of software-as-a-service and the public clouds. And while we say "public" which has connotations of "openness" – these are all very proprietary clouds.

For businesses (and individuals), the level of "lock in" that can occur with these clouds is profound. I work on OpenShift and RHEL because it helps blunt the power and lock-in of those clouds – and crucially we support exactly the same software for on-premise. (There’s really a whole blog post to write about the CoreOS model of having the exact same bits across multiple clouds/onpremise and making that work sanely). And our code is FOSS.

I pay extra attention to bugs for people who are doing Fedora CoreOS, OpenShift/RHEL on premises (particularly bare metal) both because the problem domain is interesting, but I personally want them to succeed. And more generally, computers are everywhere – and so is Linux. Outside the datacenter, from "edge devices" to workstations, I think Red Hat has been a powerful force to sustain FOSS and Linux.

Broadly speaking, a lot of the "why" here aligns with the broader Red Hat values – and that’s one of the reasons I’ve been at Red Hat so long (16 years this month). Initially I worried this would change with the IBM acquisition, but that hasn’t happened at all; I think Red Hat still has a lot of the same "soul" as the company I joined so long ago – although it’s grown many times that size! And despite that growth, it feels far from a "Big Tech" company. On that topic, we have a lot of interesting job openings!

To summarize: Computers are everywhere; FOSS is an important part of ensuring that users/businesses have control over their computers, not the other way around. And Red Hat has done (and I believe will continue to do) a good job of sustaining FOSS in the enterprise. So that’s why when I sit down with my coffee and turn on the computer, it’s to work at Red Hat.

Still on Github

Over 4 years ago now, I wrote about moving ostree to Github, and I wanted to add an update here. I still think it was the right move.

Free Software is important to me – but I think Github overall provides a lot more benefit to FOSS than harm from its mostly proprietary nature. Providing a zero-cost mostly reliable featureful platform (also with various zero-cost CI available) is a huge accelerant to all the FOSS projects that use it. And whenever I have to try to contribute a patch via email that has no CI checking I sometimes just want to throw up my hands and move on.

But for the people who don’t agree with me and think Free Software needs free tools – I say awesome. I am very glad you exist, and really there’s about 20% of me that also agrees. That part of me is happy when I come across projects hosted in e.g. Gitlab.com at least. It’s obviously good for there to be some diversity and competition, beyond the fact that Gitlab is at least at the core FOSS. I also hope at some point somehow pagure’s model of storing issues and PR comments in Git takes off too. Or maybe it’ll be something like Radicle.

Anyways, that’s really all there is to say – I continue to use Github for those reasons but I’m happy to see new tooling that might also win in the future. Or just cool developments in existing tools. My goal here is just to have these current thoughts written down so I can link to it in various places.

Committed to the integrity of your root filesystem

Quite a while ago I came across the SQLite testing page and was impressed (and since then it’s gotten even better). They’ve clearly invested a lot in it, and I think SQLite’s ubiquity is well deserved.

When I started the ostree project I had this in mind but…testing is hard. We have decent "unit test style" coverage since the start but that’s not very "real world". We’ve gone through a few test frameworks over the years. But to the point of this blog post: I finally had a chance to write some new testing code and I’m happy with how it turned out!

TL;DR: There’s a new "transactionality" test run on every PR that uses a mix of e.g. kill -9 ostree and reboot -ff while updates are running, and verifies that you either have the old or new system safely. (PRs: ostree#2048 and ostree#2127).

But along the way there were some interesting twists.

Test frameworks and rebooting

I mentioned we’d been through a few test frameworks. An important thing to me is that ostree is a distribution-independent project; it’s used by a variety of systems today. Ideally, our tests can be run in multiple frameworks used by different distributions. That works easily for our "unit tests" of course, same as it does for many other projects (make check style tests that are nondestructive and run as non-root).

But our OSTree tests want a "real" system (usually a VM), and further the most interesting tests need to be destructive. More than that, we need to support rebooting the system under test.

I’d known about the Debian autopkgtest specification for a while, and when I was looking at testing I re-evaluated it. There are some things that are very Debian-specific (how tests are defined in the metadata), but in particular I really liked how it supports reboots.

There’s a big tension in test systems like this – is the test logic primarily run on the "system under test", or is it on some external system which manages the target via e.g. ssh? We had lots of problems in our prior test frameworks was dealing with reboots with the latter style. Plus the latter style tends to strongly tie the test code to the test harness.

In the Fedora CoreOS group we use a system called "kola" which came from the original CoreOS project. It knows how to boot systems using Ignition in various clouds along with qemu. I added partial support for the Debian Autopkgtest specification to it (cosa#1528).

Avoiding shell script

A lot of the original ostree tests are in shell script. I keep finding myself writing shell even though I also keep being badly burned by it from time to time.

So another tangent along the way here: For writing new tests I’d resolved to use "not shell script". Python would be an obvious choice but…another large wrinkle here is that in CoreOS we don’t want interpreters in the base OS – they should run as containers (yes, a shell is obviously an interpreter too but…). So going the interpreted test route would drive us towards having our test framework run as a privileged container. I decided not to do this for a few reasons; the biggest is that makes it much harder to test the system as other processes see it.

My preferred language nowadays is Rust, and it generates static-except-libc binaries that we can just copy to the host. Further, fortuitously someone else created Rust bindings to ostree and I’d been wanting an excuse to use that for a while too! However…some things are just too verbose via API, and plus we want to test the CLI too. Invoking subprocesses via Rust std::process::Command is also very verbose. So I ended up creating a sh-inline crate for Rust that makes it ergonomic to include snippets of strict mode bash in the code. This snippet is a good example. I’d like to make this even more ergonomic too, but my proc-macro-fu isn’t there yet.

Actually writing the test

OK so all those prerequisites out of the way, the first thing I did was write the code to do the "try upgrading and while that’s running, kill -9 it". That went reasonably quickly and worked well, so I moved on to the more interesting case of adding reboot -ff (simulating immediate power loss) as another "interrupt strategy". This excercises the whole stack through the kernel, particularly interactions with the filesystem.

However, this required completely rewriting the control flow because here the "test harness" is also being forcibly killed. We don’t want to rely on persisting our state to the disk on the system. I ended up serializing the process state into AUTOPKGTEST_REBOOT_MARK, which gets stored in the harness and passed back when the process starts again. Effectively then the test code becomes a sort of coroutine with the harness.

Found problems

Depending on how you look at it, fortunately or unfortunately: none so far. One motivation for writing this test was to try to reproduce a bug a user filed that showed an error message from the boot loader configuration handling code. I haven’t managed to reproduce that yet. I did manually inject some faults in the code and verify that the test failed of course. And in the past I’ve of course done some manual testing to verify that ostree does what it says on the box for implementing transactional upgrades. But there’s clearly more to explore here.

Next steps

One thing I plan to explore next here is fault injection, probably with strace fault injection. This may also combine well with adding support for the harness to request explicit sleep() calls to widen the window on possible races. Plus so far while I’ve mentioned support for other distributions, this is only testing Fedora CoreOS in its default mode; e.g. we’re only validating xfs and not other Linux filesystems, etc.

Are we testing like SQLite yet?

Definitely not, but I’m happy that I made some progress closer to that goal! It was an interesting project and I’m looking forward to building more of it per above. Outside of OSTree, the goal of this blog was write down some of the "lessons learned" for others working in this space. For example, I hope some people working in the Linux-based OS testing space look at the Debian autopkgtest; it can be hard to come to consensus on test frameworks and standards, but there are at least some good ideas there. Also I think the mix of "Rust with some inline shell script" worked pretty well for these types of tests; particularly if the CLI outputs JSON, deserializing with Serde is great. Though taking the Rust compile time hit for tests is a downside.

But in the end, I can at least now say that every pull request to OSTree runs through a test suite that ensures it survives being forcibly terminated while an update is running. The integrity of your root filesystem is very important to me – it should be robust and image-like, but still a Linux system in the end. If this sounds good to you, I hope you check out one of the distributions that use it!

“Immutable” → reprovisionable, anti-hysteresis

This is going to be a longer blog entry, but here’s a TL;DR:

I propose that instead of "immutable" or "read-only" when talking about operating systems (such as Fedora CoreOS, Google COOS, Flatcar etc.), we use these terms:

"fully managed": The system does not have "unmanaged state" – e.g. an admin interactively doing ssh and making changes not recorded declaratively somewhere else
"image based": Traditional package managers end up with a lot of "hidden state" (related to above); image based updates avoid that
"reprovisionable" and not a "pet": I don’t like the industry "pets vs cattle" term, and I think "reprovisionable" is both nicer and more descriptive
"Has anti-hysteresis properties": (Yes I know this is an awkward term) See https://en.wikipedia.org/wiki/Hysteresis – I’ll talk more about this later

(Terminology note: In this article also I will use the abbreviation "pkgmgrs" for "traditional package managers like apt/yum". Systems like NixOS and some aspects of swupd from Clear Linux improve parts of what I’m talking about, but this article is already really long and a detailed comparison including those really deserves a separate post)

Why not "immutable"/"read-only"?

Because it’s very misleading. These system as a whole is not immutable, or read-only, or stateless – there are writable, persistent data areas. And more importantly, those writable data areas allow persistently storing privileged code. They have to because these OSes need to support:

the user being root on their own computer
In place OS updates

(What about systems that don’t support "in place" updates? Yes, there are people/organizations who e.g. build a new cloud image for every change, and often don’t even enable ssh or any persistent writable state for the OS. This is fine, but one problem is it doesn’t generally apply outside of cloud/IaaS environments on e.g. bare metal machines, and it can make upgrades for small changes very disproportionately expensive. )

Back to operating systems with state that some people call "immutable":

But /usr is read-only!

Yes. And this does have some security benefits, e.g. this runc vulnerability isn’t exploitable.

But in order for the operating system to be updated in place, there must be some writable area to add new OS content – so it’s not immutable. The details of this vary; a number of "image based" operating systems use dual partitions, OSTree is based on hardlinking with a "hidden" writable data store.

The real reason to have a read-only /usr is to make clear that the content of that directory (the operating system binaries) are "fully managed" or "owned" by the OS creator – you shouldn’t try to overwrite or replace parts of it because those changes could be overwritten by a future update.

And this "changes in /usr being overwritten" is a real existing problem with traditional package-manager systems (pkgmgrs). For example, a while ago I was looking at Keylime and came across this bit in the installer. That change would be silently overwritten by the next yum/apt update, so the system administrator experience would be:

Provision system
Install things (including keylime)
⌛ Time passes
Apply OS updates (not on by default), then keylime breaks for a not obvious reason

The more correct thing instead would be for that playbook to write a systemd drop in in /etc to override just ExecStart=, although even doing that is fragile and what’d be best here is to make this an explicitly configurable option for tpm2-abrmd in a config file in /etc.

The overall point is that the reason /usr read-only is primarily to enforce that user configuration is cleanly separate from the OS content – which becomes particularly important when OS updates are automatic by default, as they are in Fedora CoreOS.

I think having automatic updates on by default fundamentally changes the perception of responsibility around updates; if I’m a system administrator and I typed apt/yum update and things broke, it’s my fault, but if automatic updates are on by default and I’m doing something else and the machine just falls over – it’s the OS vendor’s fault. Linking these two together: Since Fedora CoreOS has automatic updates on, we really need to be clear what’s our responsibility and what’s yours.

Now, this isn’t a new problem, and most people maintaining systems know not to do the kinds of things that Keylime Ansible playbook is doing. But it’s an extremely easy mistake to make without strong discipline when /usr is sitting there writable by any process that runs as root. I’ve seen many, many examples of this.

Nothing actually stops traditional package managers from mounting /usr read-only by default – they could do the equivalent of unshare -m /bin/sh -c 'mount -o remount,rw /usr && apt update‘ internally. But the challenges there grow into adjusting the rest of the filesystem layout to handle a readonly /usr, such as how OSTree suggests moving /usr/local to /var/usrlocal etc.

Image based updates

Usually instead of talking about an "immutable" system that allows in place updates, it’d be more useful and accurate to say "image based".

And this gets into another huge difference between traditional package managers and image based systems: The amount of "internal state".

The way most package managers work is when you type $pkgmgr install foo, the fact that you want foo installed is recorded by adding it to the database. But the package manager database also includes a whole set of "base packages" that (usually) you didn’t choose. Those "base packages" might come from a base container when you podman/docker pull, for cloud images the default image, and physical systems they often come from a distribution-specific default list embedded/downloaded from the ISO or equivalent.

A problem with this model then is "drift" – by default if the distribution decides to add a package to the base set by default, you (usually) don’t get it by default when applying in place updates since most package managers just update the set of packages you have. One solution to this is metapackages, but if not everything in the base is covered you still have drift that can be hard to notice over time.

I think for users of many pkgmgrs this "initial state" is hard to disentangle from things you typically do care about like the packages you chose to install. There is e.g. apt-mark showmanual and dnf history userinstalled commands.

And…trying that out by pulling the docker.io/debian:stable image, it claims:

# apt-mark showmanual
iproute2
iputils-ping
#

And that’s the first command I ran in the image! Clearly a bug somewhere. For the fedora:32 base image it lists a bunch of packages that correspond to the bits in the base kickstart – but that’s not something I as the user wrote.

By analogy with /usr vs /etc – this is like mixing local configuration in /usr.

This problem extends beyond the "user installed" database: traditional package managers aren’t aware of the "base bootimage" which operates on a separate infrastructure layer. apt has no idea about the of the OpenStack image/AMI/qcow2 or whatever that formed its initial state, nor is it aware of the OCI/docker container initial image (and conversely, e.g. podman/docker have no idea that yum/apt etc. are running inside).

So over time, the state of the system with traditional pkgmgrs is a function of many things:

Which packages you chose to install (obviously)
The set of packages from the initial "bootimage" or container image
More subtle things like which packages are in the "user installed" database
Even more subtle things can happen when weak dependencies like Recommends change in upstream packages
The package manager version: RHEL8 yum has autoremove on by default, RHEL7 and older yum doesn’t

One solution to this type of "drift" is to not use packages at all (pure "base OS" + "apps/containers") like Google COOS, or to group things at a higher level (Clear Linux is more in this bucket).

I’m pretty happy though with the design we came up with for rpm-ostree used by Fedora CoreOS/Silverblue/IoT; there is a clear "base commit" that comes in OSTree format, and you can add packages on top – recasting RPMs as "operating system extensions" (see also this OpenShift enhancement).

For rpm-ostree it’s really simple – by default it operates in pure ostree mode by default, so if you don’t layer/override any packages you are exactly replicating an ostree commit – and that’s it! You don’t need to think about packages by default.

Particularly for Fedora CoreOS, there is almost nothing in the "bootimage" (ISO, AMI equivalent) that isn’t part of the ostree commit.

In other words, "state of installed software" is a function of (effectively) one thing by default:

The ostree commit

It’s even stronger than that really, it’s not just "same packages" it’s "bit for bit identical /usr filesystem". However, there is one important note: /boot does come from the bootimage, see this issue.

Bootloader aside, effectively all of the OS state you care about then does not depend on which bootimage you happend to use to install initially. When OSTree performs an update, it does not matter what the "previous" commit was – the old and new implicitly share files via the hardlink store, but updates always involve a "fresh checkout" of the new commit. Every upgrade is like a fresh OS install of that version with your configuration (/etc) and state (/var) re-applied.

With rpm-ostree being a hybrid system, you can choose to engage package layering (or overrides). But the system very clearly highlights that list; note a major simplification is combining the "packages you installed" and "user installed" lists. The rpm-ostree model is very simple: you have a "base commit/image" and your extensions. For example:

$ rpm-ostree status -b
State: idle
BootedDeployment:
● ostree://fedora/32/x86_64/silverblue
                   Version: 32.2 (2020-08-22T17:28:53Z)
                BaseCommit: 080312021f34c7763089ff12fcd2964647e0f55ac3981f869b56d232a33990f6
           LayeredPackages: fish libvirt tmux virt-manager

An important but subtle detail in achieving this simplification: by default, rpm-ostree doesn’t allow marking a base package as user installed. Generally the idea is that removing user-interesting packages from the base image is something you shouldn’t do.

rpm-ostree goes to some lengths internally to make this split happen; the libdnf/rpm layers don’t have any model of "base image" because everything’s a package to them.

Has anti-hysteresis properties

I know "has anti-hysteresis properties" is an awkward phrase (and I’m happy to hear alternatives) but I think hysteresis is a great term that we should start using in computing. Today it seems to mostly be used in the sciences but I propose adopting it – this in the spirit of making computer science more like a real science.

Let’s take a look specifically at elastic hysteresis because it’s easy to understand and even try at home.

Basically, rubber bands have "hysteresis" ("hidden state"/"memory") which comes from how much it was stretched in the past. And this state is basically impossible to see by just looking at the rubber band. For a related example with rubber, see the two balloon experiment.

To tie together the previous section on package managers with this:

Systems managed by traditional package managers (apt/yum/etc) have a lot of effective hysteresis. I think even many experienced system administrators would have trouble confidently and precisely explaining how the multiple things listed above (the container or IaaS base image, package manager user installed database, etc.) all interact in forming the final state of the system over time as in-place upgrades are applied.

Configuration management systems and hysteresis

This "hysteresis" problem occurs not just in package managers but also many configuration management systems (puppet/ansible/etc).

A simple example I’ve seen happen is where the system administrator writes a playbook (or equivalent) that does e.g.:

- name: Allow nopasswd for wheel
  lineinfile:
    path: /etc/sudoers
    state: present
    regexp: '^%wheel ALL='
    line: '%wheel ALL=(ALL) NOPASSWD: ALL'

Then later, say the organization wants to change to use a separate group instead of wheel, say admins or whatever.

If the playbook is changed in git to do:

- group:
    name: admin
    state: present
- name: Allow nopasswd for admins
  lineinfile:
    path: /etc/sudoers
    state: present
    regexp: '^%admin ALL='
    line: '%admin ALL=(ALL) NOPASSWD: ALL'

The previous change to modify wheel in /etc/sudoers will silently persist (until the system is reprovisioned). And that could become a security problem even in this case.

In most of these configuration management systems, in some cases the admin may need to explicitly add a change which reverts a prior change, and then makes the new change. But not all of the time – some (most) changes don’t need this.

It’s an easy mistake to make when writing effectively arbitrary code to change files in persistent filesystems.

Hence, configuration management systems are subject to hysteresis too, and I think many of them could do better in warning users about this, and pushing for better practices. For example, the playbook would be more "anti-hysteresis" if it wrote to /etc/sudoers.d/mycustom.conf which gets replaced entirely, though /etc/sudoers.d is only supported by relatively modern sudo I think.

Kubernetes is fairly opinionated in having code in container images you pull (equivalent of /usr), and storing configuration in a configmap (which would get projected into environment variables or files in /etc). When you update a deployment, all state in the (sadly writable by default) pod container filesystem is thrown away, and there’s also no leakage from any previous version of a configmap. So we could say that the Kubernetes approach to applications has strong "anti-hysteresis properties".

The OpenShift Machine Config Operator defaults to anti-hysteresis

Tying together the Kubernetes and operating system threads: in OpenShift 4, the machine-config-operator allows you to write config files and systemd units into the operating system /etc by using kubectl/oc. (The original goal of etcd was in fact to do this, then Kubernetes happened and the focus shifted to that layer. In OpenShift 4 we are meeting that original goal of storing the Unix /etc in etcd via the MCO.)

The reason I claim the MCO has "anti-hysteresis" is it keeps track of the old and new system states reliably and is able to diff them. For example, if you write a config file for chrony to set the timeserver, then later kubectl delete machineconfig/my-chrony-config since you’re fine with the default, the MCO will notice that the old config wrote /etc/chrony.conf and the new one doesn’t, and it will correctly revert the file back to the default.

Just like OSTree has a checksum describing the state of /usr, the MCO maintains a checksum for its state and when you look at a node, you can say its configuration is e.g. rendered-master-<checksum>. If a system can describe its state with a checksum, that implies it has strong anti-hysteresis properties.

Now, there are holes in this model. If for example instead of writing a file directly, you create a systemd unit which does e.g. ExecStart=/bin/echo somedata > /etc/someotherfile, and you later delete that unit – the file will persist. The reason why relates to this FAQ.

A general pattern here is that any place you have arbitrary code that changes over time writing to persistent files, you’re at risk of hysteresis (or "unmanaged state").

Reprovisionable

OK, so systems with anti-hysteresis properties are good. But in practice, I think there’s always going to be that small amount of "unmanaged state" that sneaks in even in organizations with strong discipline. For example, a system administrator trying to debug one node and using ssh to edit a file directly to increase the debug level of a service, and then later that causes a problem by flooding the log system or causing more I/O to the local filesystem and increasing latency for other services.

And this problem isn’t just at the operating system layer; at the IaaS/CaaS layer it’s easy to have VMs or containers that were created manually to debug something and then "leak" unless actively removed.

In an IaaS deployment there are a wide variety of objects in general (storage buckets, SaaS etc.) and equally many tools to deal with leaks at that level; usually this boils down to a "resource tagging" approach. (One thing I think is nice about GCP over e.g. AWS is the "project" approach, specifically this bit: "This model can also be useful for testing purposes: once you’re done with a project, you can delete the project, and all of the resources created by that project will be deleted as well.")

At the operating system level (and at the IaaS level if you can too), I think a good way to deal with this is to periodically reprovision, e.g. once a month (if you can do faster, great) on a rolling basis. In OpenShift 4 for example with the machine-api-operator that would just be a small amount of code (a custom controller running as a pod) to periodically kubectl delete machine/<somemachine> based on whatever criteria you want – the platform will handle the rest, spinning up a new one to take its place. Currently this only applies to workers but I hope we can cover the control plane in future releases. A neat thing about this is that the IaaS layer (virtual machines) are just Kubernetes custom resources that are managed via the cluster.

Conclusion: We want reprovisonable, anti-hysteresis systems

There a whole lot of current terms for what I’ve covered above, "gitops", "managed configuration", "cattle", stateless", "immutable infrastructure" etc. I’m suggesting the goal is: reprovisionable infrastructure with anti-hysteresis properties. But, I’d also be happy if we used "reprovisionable" instead of "cattle", and also if we introduced the term "anti-hysteresis" instead of "immutable" (where applicable).

On BTRFS

There’s been a lot of discussion on this proposed Fedora change for Workstation to use BTRFS.

First off, some background: I reprovision my workstation about every 2-3 months to avoid it becoming too much of a "pet". I took the opportunity for this reprovision to try out BTRFS again (it’d been years).

Executive summary

BTRFS should be an option, even an emphasized one. It probably shouldn’t be the default for Workstation, and shouldn’t be a default beyond that for server use cases (e.g. Fedora CoreOS).

Why are there multiple Linux filesystems?

There are multiple filesystems in the Linux kernel for good reasons. It’s basically impossible to optimize for all use cases at once, and there are fundamental tradeoffs to make. BTRFS in particular has a lot of features…and those features have costs. Not every use case needs those features, and the costs can be close to prohibitive for things like databases.

BTRFS is good for "pet" systems

There is this terminology in the industry of pets vs cattle – I once saw a talk that proposed "elephants vs ants" instead which is more appealing. Lately I tend to use "disposable" or "reprovisionable" for the second term.

I mentioned above I reprovision my workstation periodically, but it’s still somewhat of a "pet". I don’t have everything in config management yet (and probably never will); I change things often enough that it’s hard to commit to 100% discipline to record every change in git instead of just running a CLI or writing a file. But I have all the important stuff. (And I take backups of data separately of course.)

For people who don’t have much in configuration management – the server or desktop system that has years of individually built up changes (whether from people doing things manually over ssh or interactively via a GUI like Cockpit, being able to take a filesystem snapshot of things is an extremely compelling feature.

Another great BTRFS-style use case is storing data like your photos on a local drives instead of uploading them to the cloud, etc.

The BTRFS cost

Those features though come at a cost. And this back to the "pets" vs "disposable" systems and where the "source of truth" is. For users managing disposable systems, the source of truth isn’t the Unix filesystem – it’s most likely a form of GitOps. Or take the case of Kubernetes – it’s a cluster with the primary source being etcd.

And of course people are using storage systems like PostgreSQL or Ceph for data, or an object storage system.

The important thing to see here is that in these cases, the "source of truth" isn’t a single computer (a single Unix filesystem) – it’s a distributed cluster.

For all these databases, performance is absolutely critical. They don’t need the underlying filesystem to do much other than pass through writes to disk, because they are already managing things like duplication/checksumming/consistency at a higher level.

As most BTRFS users know (or have discovered the hard way) you really need to use nodatacow for these – effectively "turning off" a lot of BTRFS features.

Another example: virtual machine images which is an interesting one because the "pet" vs "disposable" discussion here becomes recursive – is the VM a pet or disposable, etc.

Not worth paying for reprovisionable systems

For people who manage "reprovisionable" systems, there’s usually not much value using BTRFS for things like operating system data or /etc (they can just blow it away and reprovision), and a clear cost where they need to either use nodatacow on the things that do matter (losing a lot of the BTRFS features for that data), or explicitly use e.g. xfs/ext4 for them, going back into a world of managing "mixed" storage.

In particular, I would strongly argue against defaulting to BTRFS for Fedora CoreOS because we are explicitly pushing people away from creating these types of "pet" systems.

To say this another way, I’ve seen some Internet discussion about this read the proposed change as applying beyond Fedora Workstation, and that’s wrong.

But if you e.g. want to use BTRFS anyways for Fedora CoreOS (perhaps using a separate subvolume for /var where persistent container data is stored) that would be mounted with nodatacow for things etcd that could make sense! We are quite close to finishing root filesystem reprovisioning in Ignition.

But a great option if you know you want/need it!

As I mentioned above, my workstation (FWIW a customized Silverblue-style system) is a seems like a nearly ideal use case for BTRFS. I’m not alone in that! I’m likely going to roll with it for a few months until the next reprovisioning time unless I hit some stumbling blocks.

However, I am already noticing the Firefox UI periodically lock up for seconds at a time, which wasn’t happening before. Since I happen to know Firefox uses SQLite (which like the other databases mentioned above, conflicts with btrfs), I tried this and yep:

walters@toolbox> find ~/.mozilla/ -type f -exec filefrag {} \; | grep -Ee '[0-9][0-9][0-9]+ extents found'
firefox/xxxx.default-release/storage/.../xxxx.sqlite: 1825 extents found
...

And that’s only a few days old! (I didn’t definitively tie the UI lockups to that, but I wouldn’t be surprised. I’d also hope Firefox isn’t writing to the database on the main thread, but I’m sure it’s hard for the UI to avoid blocking on some queries).

I just found this stackoverflow post with some useful tips around manually or automatically defragmenting but…it’s really difficult to say that all Fedora/Firefox users should need to discover this and make the difficult choice of whether they want BTRFS features or performance for individual files after the fact. Firefox upstream probably can’t unilaterally set the nodatacow option on their databases because some users might reasonably want consistent snapshots for their home directory. A lot of others though might use a separate backup system (or Firefox Sync) and much prefer performance, because they can just restore their browser state like bookmarks/history from backup if need be.

Random other aside: sqlite performance and f2fs

In a tangentially related "Linux filesystems are optimized for different things" thread, the f2fs filesystem mostly used by Android (AFAIK) has special APIs designed specifically for SQLite, because SQLite is so important to Android.

Conclusion

All Fedora variants are generic to a degree; I don’t think there will ever be just one Linux filesystem that’s the only sane choice. It makes total sense to have BTRFS as a prominent option for people creating desktops (and laptops and to a lesser degree servers).

The default however is an extremely consequential decision. It implies many years of dealing with the choice in later bug reports, etc. It really requires a true committment to that choice for the long term.

I’m not sure it makes sense to push even Linux workstation users towards a system that’s more "pet" oriented by default. How people create disposable systems (particularly for workstations) is a complex topic with a lot of tradeoffs; I’d love for the Fedora community to have more blog entries about this in the Magazine. One of those solutions might be e.g. using a BTRFS root and using send/receive to a USB drive for backups for example!

But others would be about the things I and others do to manage "disposable" systems: managing data in /home in git, using image systems like rpm-ostree for the base OS to replicate well known state instead of letting their package database be a "pet", storing development environment as a container image etc. Those work on any Unix filesystem without imposing any runtime cost. And that’s what I think most people provisioning new systems in 2020 should be doing.

Controlling ssh via kubectl

In a recent conversation, an OpenShift 4 administrator used the phrase “Because SSH is disabled in these AMIs…”. But that’s not the right way to think about things! With OpenShift 4, the goal is that all aspects of the operating system are controlled by the cluster and one can configure them in a uniform fashion “day 1” and “day 2” via kubectl/oc.

Let’s unpack this a bit more. First, with Fedora/RHEL CoreOS, we have a high degree of uniformity for the base OS across platforms – for example, we avoid including any platform specific agents, and we also try really hard to have the OS work the same way across platforms because while we have different disk images per platform (unavoidable), we have exactly one in-place update stream that applies across all of them.

The state of having SSH enabled or not does not vary across platforms or “AMIs” – it works exactly the same on a bare metal OpenShift 4 install as it does in AWS.

Second, we use Ignition and the machine config operator as a uniform way to configure things – so if you want to enable or disable SSH or configure keys – it again works exactly the same across platforms.

Specifically, if you provide SSH public keys to the installer configuration that basically ends up generating an Ignition config that applies on firstboot – the “AMI” here isn’t relevant.

This part of “configure the OS via Ignition” is common to Fedora and RHEL CoreOS. With OpenShift 4 we also have integrated “day 2” changes, so one can e.g. update SSH keys “day 2”.

To emphasize this, as an OpenShift 4 administrator, you manage your ssh keys via oc/kubectl – the same way you manage other aspects of the cluster, and the same way across platforms.