Project Atomic + Docker: A post-package world?

I recently was talking with a friend over lunch about Project Atomic and Docker, and he asked: are we entering a “post-package” world?

My short answer: No. The slightly longer answer is that we’re seeing an evolution of delivery coupled with a lot of innovation in management and orchestration.

Evolution of delivery

As part of Project Atomic, we’re evolving from the context of a “traditional” distribution, where distribution is a set of packages. The Project Atomic pattern is introducing two new higher order delivery vehicles: Docker, and rpm-ostree (also nicknamed via symlink atomic). The theme behind the name Atomic is that both of these technologies group software into indivisible units of management.

Let’s look at two artifacts from Project Atomic we’re working on in the Fedora distribution that are shipped in this way: the Atomic Cloud Image and the Docker Container Image. An essential fact to note is that both artifacts are composed of RPM packages.

For example, both the host system and container share a set of essentials such as the bash package. In fact, the idea is, at release they will have the same binary version. Both the Docker base image and the Atomic tree are reflecting the upstream RPM content. This is quite crucial for a distribution such as Fedora; both from the perspective of the maintainers as well as the downstream consumers. If you want to check whether an Atomic host or a container is affected by a security vulnerability, you can use the regular rpm -q, or any of the many higher order tools and frameworks built upon that core concept of an inventory of versioned component parts.

Runtime management power

So what’s the value, then, in boxing up the same old packages in new ways? For Docker base images (and derived images), there’s a massive increase in flexibility – it makes Linux containers very, very easy to use. A simple example is that the host system can now be decoupled from the applications; when the Fedora 22 release of Atomic comes out, your Fedora 21 base image containers can function effectively unchanged, except they will have a newer kernel. You can take advantage of newer hardware support in the kernel or other host features, and stage a migration of applications to newer base images.

It’s much easier now to take those same RPM packages for services and multi-instantiate. For example, you can have a Docker base image that contains a mariadb-server RPM, and instantiate multiple writable containers from that, each with their own copy of /etc/my.cnf.

Things get even more interesting with projects like geard, which make it easy to spin up and configure many containers across multiple host systems. This sort of orchestration is much more complex and expensive with virtual machines.

The continuing need for packages

Whenever a single organization starts to produce multiple products, there comes a very strong pull to define a common shared base. And the Project Atomic artifacts are not the only product of Fedora! It still needs to deliver traditional products, such as the Server and Workstation.

In the Server case, for example, there will obviously be a strong continuing demand for a virtualization host system, manifesting as projects such as oVirt. There’s also a case for a system capable of both virtualization and Linux containers.

And I think the Workstation case still makes sense. I use Linux and other Free Software on my desktop for real work – with Docker, it’s quite nice to be able to test server containers locally before pushing them. I can have the same Docker version on my workstation and servers, or decouple them. The distribution mechanism should continue to cover this. For that matter, the desktop system I’m writing from of course has virtual machines running Atomic, thus bringing four deliverables together.

One might ask: does it even make sense to do this many products within the context of a single organization? I think it does. There are a lot of powerful benefits to still receiving atomic host system updates and Docker base images from the same organization. A lot of userspace is shared, and it allows crossover for things like management tooling.

Now, one can of course find sub-groups within the (large) Docker community that are farther down the “post-package” spectrum, but I just don’t think it’s a realistic viewpoint. As an example, look at the upstream docker-registry README.md. While it tells you how to pull the binary Docker from the upstream registry, it also documents how to acquire the individual pieces and run it directly. And there’s real reasons for that, such as being able to build the registry from source code and improve it. To do that, you need the build dependencies, distinct from the runtime. Yes, the README uses pip instead of dpkg/rpm, but the concept is the same.

To be clear then, it’s not about Docker replacing packages: the realistic endpoint is blending the strengths of the two technologies. One example of that some people have been looking at is using Docker as a buildroot construction system for RPM packages.

Finally, on the OSTree side, things are quite a bit simpler. Conceptually, it’s just a way to compose packages on a server (instead of per client), providing each with atomic upgrades. Then rpm-ostree is a tool bridging the world of RPM and OSTree; it’s very much oriented around being a complement to RPM. The rpm-ostree tool also links to hawkey to allow it to inspect and operate on the RPM database inside the trees. More information about that here.

Getting involved

With Project Atomic, we’re not just introducing new software; we’re attempting to change how we deliver software, something deeply fundamental to a distribution project like Fedora. And furthermore, we’re changing how it’s deployed and managed, which impacts application authors and systems administrators. That said, I believe the benefits of Linux containers and Docker are very real.

Want to get involved? Jump in on the Project Atomic community, or see active SIGs and discussions in targeted distributions such as CentOS and the Fedora Cloud group which is hosting the Changes/Atomic Cloud Image. There’s plenty to do in infrastructure and release engineering. Check out fedora-dockerfiles for lots of example Dockerfiles, and try building your own apps. And don’t hesitate to ask questions!

GNOME West Coast Summit end

The West Coast Summit 2014 is over now, and I’m glad I was able to attend. There’s absolutely no substitute for getting a distributed group of people together for face to face conversations about their common interest in GNOME. Thanks to Endless Mobile for providing their office as a venue and sponsoring the event!

It was really great to see familiar faces like Germán, Giovanni, and Kristian (among many others!). Breakout sessions on topics like GNOME on Wayland and Gjs were very successful. It was cool to see GNOME on Wayland (well, it looked the same actually which was the goal 😉 ). Giovanni did an amazing amount of work on investigating the Spidermonkey GC. Christian wowed people with a demo of Builder. I worked on Continuous and OSTree. In particular, on the OSTree branch for static deltas, which should significantly speed up downloads.

OSTree: rigorous and reliable deployment

I sometimes describe OSTree as being even more rigorous than traditional dpkg/rpm type package systems. Now, there are some of you out there who probably can’t imagine how that’s possible. You found packaging so tedious and painful that you gave up, and you now write Go code (because Google wrote it, it must be good, right?) and you hack on your MacBook from a coffee shop, and when you’re ready scp your statically linked binary to staging and then to production. Maybe you don’t even have staging. It’s so simple! Look how fast it is!

If you are one of those people, just think about what happens when you forgot to “git push” for a while, or you had “origin” be a local mirror or something, and then you lose your MacBook, and now you have a big statically linked blob running in production to which you no longer have the source.

In contrast to this developer, packaging is pretty rigorous. Production build systems ensure that all the source to particular artifacts are tracked, have a distinct, clean, and (mostly) reproducible build environment. For example, that your build system isn’t downloading stuff from the Internet in the middle.

On the deployment side, with packaging you can always log into your server and see what is installed, with version numbers. There’s a lot of advantages to that over a developer deploying binaries with scp.

Knowing what’s running

So how is OSTree more rigorous than traditional packaging? It’s very simple – when you run ostree admin status – you are also getting a description of what is running, not merely installed. At the moment, the simple implementation of that is that you must reboot to have a change take effect. On the plus side, you have fully atomic upgrades. But – we definitely can do partial live upgrades, which is the subject of this post.

With dpkg/rpm and friends, there isn’t a reliable link between the package system and the init system (today, systemd or historically one of the sysvinit implementations). For example, whether or not a service gets restarted on package change is up to the packager of the daemon, and furthermore it’s just a shell script called out from a %post. There’s nothing in the system to audit whether or not the daemon has been successfully restarted, and how that relates to the package change.

Conversely, it’s a pretty sad state of affairs that systemd is totally unaware of packaging. Now most existing administrators understand this, and know the technological/organizational/political[1] reasons this is the case.

Whether or not a daemon got restarted is only one of the obvious ways in which installed and running become distinct. A much more common case is upgrading a shared library such as libc.so.6. If we understand that not all daemons or code may be instantly restarted, then we have a situation where the package system is recording merely what’s installed – an administrator later logging in and debugging a failure may have to reconstruct that this system was live upgraded via noting the (deleted) suffix on the shared library in /proc/pid/maps. If they know to look there of course…

So how do I plan to preserve the present property that OSTree has in that it describes what’s running and not merely installed? Let’s be honest, it’s a hard problem. But take a simple case – we are running a tree with checksum ac81df, and we live-apply a subset of the files from the new tree 59da1a as an overlay on top of the running tree. Then ostree admin status might say something like this:

* fedora-atomic ccc6ff1d1d6fdfcb7309700af8fec5de61511767b6ed43f77feb549f7afcaefb.0
    origin refspec: local:fedora-atomic/rawhide/x86_64/buildmaster/base/core
  Dynamic overlays:
    e211ced8ab16c6e7eba1730b58b7d9fc4f6c8d285be9fcec01ffff4dfccd5cbf.0
      /usr/bin/bash
      /lib64/libc.so.6

Here we’d be seeing the case where our new tree pulled in an updated bash binary, and a new libc. Furthermore, we can backreference from the (device, inode) pair in any running processes /proc/pid/maps to the originating tree – because it won’t be physically deleted as long as it’s still referenced. Also, for any live-upgrade system via OSTree, I plan for it to be fully aware of systemd, and carefully audit the return values from service restarts, correlating it with the state of the filesystem.

This is still a relatively simple case. Think about the situation where you upgrade two or three times, and do partial live updates from each of them. OSTree would carefully maintain the precise manner in which you upgraded – it would be reproducible by others. You’d be able to backreference from any code in memory to the originating tree, which contains the manifest of binary versions, which finally link to source code.

Now with yum history, one could theoretically reconstruct a lot of this, but again yum (really rpm) suffers from being so generic that the core operation of interacting with things like systemd is just a callout to an un-audited shell script. Your current desktop and servers are probably a messy blend what I call “partial live updates”.

[1] And when I say political, let’s imagine what happens when someone posts the first patch to show the package name from systemctl status…

Giving a name to your root filesystem

First, OSTree v2014.1 is out! Nothing earthshaking, but I’m happy with some of the fixes and features there.

One thing that’s absolutely fundamental about OSTree is that it forces one to name complete filesystem trees. While the system does not mandate any convention (they’re just strings), you have seen some examples in previous posts, like gnome-continuous/buildmaster/x86_64-devel-debug and fedostree/20/x86_64/base/minimal. Here the “OS” name starts first, and after that, you can choose whatever format you want. Now, traditional dpkg/rpm packages are names for partial filesystem trees (plus some metadata and scripts that run as root). When they’re assembled via a package manager onto the root partition of your drive, that collection is not normally named – what you have is an anonymous, and very often unique, custom set of packages.

There are of course efforts in various package systems and GNU/Linux distributions to attempt higher level management of software beyond “set of packages”. In Debian, metapackages are common. In Fedora, there is comps.

I could talk for quite a while about the management differences between the metapackage vs comps approaches, particularly after YumGroupsAsObjects. But suffice to say that I think both suffer badly from being glued on top of the “set of packages” model. In many cases they end up making things more complex, not less. Here is a blog entry that describes how Debian’s metapackages clash badly with another tool which tries to remove “unused” packages. From my observations in the Fedora context, comps groups are mainly used for initial system installation (in Anaconda) and early set up perhaps you do yum install @virtualization after installing a workstation.

How OSTree is less flexible, but more rigorous

With OSTree, you can say something like “I’m running fedostree/20/x86_64/base/minimal”. This is a name for a filesystem that was replicated from the rpm-ostree build server – and it is immutable. OSTree itself comes with no application mechanism, or even the ability to layer trees. So this is a far stronger and more rigorous description of the contents of your (visible) root filesystem.

For example, with the current rpm-ostree, if I remove a package from products.json, then it drops out of the filesystem tree composed on the server side, and thus will also disappear when clients upgrade. It’s really quite simple. The problem of removing old, unused packages is a messy subject in package systems like dpkg/rpm – it’s painful at the distribution level with things like Obsoletes, and if you are a downstream consumer of the distribution, if you installed a package at some point on your servers that you want no longer installed, your best bet is to use something like Puppet to assert that packages are removed.

Now, you still may be thinking “OSTree sounds cool, but I want to be able to install things!”.

Downstream tree construction and naming custom trees

I mentioned in the previous post that I plan to implement a feature like yum-ostree install strace, which would assemble a new filesystem tree from packages (just like rpm-ostree does the server side), and set it up for the next boot. But an interesting question arises – how should I name this filesystem tree? We could represent “install” by appending the string “+strace” to our current tree; so we might end up with a tree named “fedostree/20/x86_64/base/minimal+strace”. Now obviously this doesn’t scale really far – and perhaps leads us back towards wanting e.g. a “tracing-and-debugging-tools” metapackage (or comps group); if you care to install strace, why not also perf? With Fedora’s comps, it’s actually quite nice that we have a reserved symbol “@” and a distinct namespace from the normal package set. So we could synthesize a name like “fedostree/20/x86_64/base/minimal+@tracing”.

What I’m going for here really is that I’d like to cut down on the combinatoric complexity of packages by emphasizing layering over arbitrary additions and removals. This doesn’t mean that we need to completely restrict the system to layering – one could clearly implement yum-ostree remove X (for naming, append e.g. “-X” to the tree name). The lowest OSTree level lets you put whatever filesystem trees you want in it. But for many cases where people want to do this kind of thing, we can turn it into configuration instead of system manipulation. For example, using systemctl mask firewalld.service over yum remove firewalld. If something is supported via system configuration, we should prefer doing that rather than creating new filesystem trees; it’s more efficient and safe to replicate a pre-built tree that’s been tested and known to work, then add configuration.

Why OSTree requires “/usr/etc”

While I often lump “dpkg/rpm” together since they’re very similar architecturally, their implementation of configuration file management is a small but nontrivial difference. This is because dpkg allows interactivity, rpm does not.

With dpkg, in the case where a file is modified locally by the user (say /etc/httpd/conf/httpd.conf), dpkg will prompt the admin interactively. It only has at this time the modified config file, and the new file. Now, personally I think dpkg prompting in the middle of open heart surgery on your root file system is completely insane. You really must use screen or equivalent if doing remote administration. But being able to see what files changed is useful.

RPM is quite opposite. While it also does open heart surgery, you might see one or two messages related to config files go by in the output spam. At the end, it’s up to you to search for .rpmnew and such in /etc. Of course there does exist tooling on top to detect this after – I’m just talking about the out-of-the-box experience. Furthermore, allowing packagers to create a distinction between %config and %config(noreplace) makes the whole thing very inconsistent from a total system perspective. It’s hard to know what will happen on an upgrade unless you know beforehand what the packager chose.

The handling of /etc for OSTree took me a while of thought. The executive summary is that OSTree requires the existence of /usr/etc which is read-only defaults. Whenever you do an upgrade (more generally, switching trees), OSTree does a basic 3-way merge. It doesn’t attempt to understand the contents of files – if you have modified a config file in any way, that wins. Unlike the simplistic “split partitions” approaches out there, this does mean that you get new default config files, and also any config files that were removed in the new tree also vanish.

Unlike the rpm and dpkg model, OSTree is fully atomic. The /etc I’m talking about on upgrades is a separate copy from your running /etc. That allows you to do interactively run whatever scripts or programs you want to ensure the system is correctly set up for the next boot, with total safety.

Another cool feature enabled by the existence of /usr/etc is ostree admin config-diff. This is something I always wanted from Unix – it can give you, at any time, all of the files you modified in /etc. This makes it easier to replicate and reproduce system state after the fact. Your OS can have a “reset all configuration” button, and revert back to a pristine /etc at any time.

There are of course other systems out there you can layer on top of dpkg/rpm to improve the situation. etckeeper is a popular one. See also this proposal for /etc/defaults in Ubuntu. etckeeper suffers from being glued on after the fact. Furthermore, you don’t really need to keep a history of the upstream defaults in git – the main feature is having a 3 way merge. And /usr/etc is a simple way to implement it – and helps OSTree preserve the “feel” of the traditional Unix. You can just vi /etc/some/configfile.ini any time you want. The config state isn’t stored in a separate partition (e.g. CoreOS mounts theirs at /media/state). But OSTree still provides atomic upgrades as the many image-based upgrade systems out there do.

OSTree in action: “rpm-ostree” and switching trees

I’ve released rpm-ostree 2014.3. Up until now, the main public consumer of OSTree has been gnome-continuous, which has operated as a live research project in CI/CD. But I very intentionally from the start split out OSTree as a potentially sharable project with other build systems (such as dpkg/rpm).

Now that rpm-ostree is a functioning demo, it will help explain the vision and technology behind OSTree much better. Before I do that, let me talk about Continuous briefly. It has two trees that it ships, the refnames are gnome-continuous/buildmaster/x86_64-runtime and gnome-continuous/buildmaster/x86_64-devel-debug. I often just use the terms -runtime and -devel-debug for short since the other bits are usually implicit. -runtime is a basic install, comparable to what one might get from a default Debian or Fedora type live CD desktop install. -devel-debug is that plus *all* of the developer tools (gcc, automake), plus developer apps (glade, anjuta).

I suspect however that nearly all of the Continuous users (GNOME developers and testers) use -devel-debug. So-runtime is there basically just to demonstrate that we can have multiple trees.

With rpm-ostree though, the set of software in Fedora is much, much larger. And it can be assembled in many, many, many different ways. rpm-ostree only captures at the moment a very tiny subset of those; these refnames look like this:

fedostree/20/x86_64/base/minimal
fedostree/rawhide/x86_64/base/minimal

Note that the single OSTree repository contains both 20 and rawhide versions of the trees. Here are some more refname examples; I’ve ommitted “rawhide” from the list for brevity:

fedostree/20/x86_64/server/docker-io
fedostree/20/x86_64/server/freeipa-server
fedostree/20/x86_64/workstation/gnome/core
fedostree/20/x86_64/workstation/gnome/default

For the actual content of these, take a look at the current products.json which generates these trees. It might also be illuminating to look at some build logs from today.

As I’ve said in a few places, this “medium” level of flexibility puts OSTree between the ultimate flexibility of packages, and the “Here is our OS, enjoy!” model of ChromeOS. Up until now, in my discussions with people about OSTree, I can often see them mentally trying trying to slot OSTree into either the “yet another package system” box, or the “yet another image update system” box. But it’s certainly not a package system (rather it complements existing ones), and it’s definitely more flexible than Chromium Autoupdate (but less efficient in some ways). For more details, see OSTree/RelatedProjects.

Okay, so we see now that OSTree allows storing multiple trees (with history). I only very recently landed this commit to OSTree to make switching between trees easier on client machines. Let’s try it; say you’ve installed the “fedostree/20/x86_64/base/minimal” ref. Now you want to change to the tree containing @standard along with docker-io. With the latest OSTree, this is just

# ostree admin switch fedostree/20/x86_64/server/docker-io

Semantically, this is somewhat like doing yum install @standard docker-io – except the transition is fully atomic. Furthermore, we get an exact replication of the tree from the build server; if for example we were transitioning from say the fedostree/20/workstation/gnome/core tree, then we wouldn’t get a mix of GNOME packages and docker-io. Our running GNOME tree would be untouched, still on disk, still running – when we reboot, then we’d be in the new server/docker-io tree. If we then did from inside the booted docker-io tree:

# ostree admin switch fedostree/20/workstation/gnome/core

Nothing would be downloaded (unless the ref was updated), and we’d be back to GNOME (again, atomically). All along this process, our changes to /etc will have a 3 way merge applied, and our data in /var will be carried along. So like image-based updates, OSTree is reliable and safe. But it has a bit of the flexibility of packages – I think OSTree could scale easily to 100 or more trees (with history!) being shipped by a vendor such as Fedora.

And of course, I’m not suggesting removing the packages as raw material – you can still compute your own filesystem trees locally. It needs some work to be made efficient – I’ll talk about that in a future post.

Cancelling computation – GCancellable (or: SIGINT versus threads versus exceptions)

From previous entries, you may have noticed that I find “cross domain” issues interesting – like ones that span programming language runtimes and operating systems simultaneously. An excellent example of this is the generic concept of “cancelling” computation.

What do I mean by cancelling computation? Let’s first define “computation”. Most people using computers are interacting with an operating system, which runs a set of processes. Here, a process is an example of computation. But in more complex software, it often happens that a single process inside has multiple related independent “tasks” (these tasks could technically be structured as threads, or coroutines inside a single thread). It’s also in some domains quite common for an application to be composed of multiple processes, where child processes communicate back with the parent. The recent Linux addition of cgroups finally give a way for the operating system kernel to logically manage a group of multiple processes as one unit.

Now that we’ve looked at different kinds of “computation” – we can say “cancelling” that computation means that the computation stops before it would have ordinarily completed, and furthermore that only that computation stops. The latter part of this definition is important – I could of course cancel my browser process(s) by pulling the battery from my phone or halting the operating system kernel, but that’s not a really useful implementation of cancellation.

To make all of this less abstract, let’s talk about a simple case – you are logged into a Unix shell, and type “sleep 30”. While that’s running, your shell will be blocked. You can press Control-C, and this results in just that “sleep” process being “cancelled” by the definition above. Under the covers, Control-C results in a SIGINT being generated for that process, and because it didn’t install a handler for SIGINT, the default action defers to having the kernel exit the process.

If you’re using systemd for your init program, the operation “systemctl stop foo” will (by default) use the Linux kernel’s cgroup mechanism to stop all processes that were ever invoked from the original foo process.

GCancellable

The above two cases are pretty simple and reliable; they’re managed by the operating system kernel. But what about the case where a single process is composed of a set of logical “tasks”? This is where things get interesting, and where this blog post starts to relate to GNOME: we’re going to talk about the GCancellable.

If you’ve used modern GNOME APIs very much, you’ve almost certainly seen the GCancellable parameter on nearly every method that involves I/O, or more generally, an operation that one might want to cancel. Let’s look at a specific example, g_input_stream_read_async. Let’s further suppose that that GInputStream is the read half of a GSocketConnection – You are writing an application which reads data from the network, such as a Bittorrent client, a web browser, or a ssh client. You want to have a “[X] Stop” button, which just stops the data copying; it should not make your entire application’s UI disappear. Sounds obvious when stated this way, but the operating system kernel doesn’t help us here – userspace code needs to be involved.

On Linux kernel based systems, a GCancellable is really just a wrapper for an eventfd(), which is just an optimized version of a standard Unix pipe() for cases where you just want notification of events on a 64 bit counter, not a general byte stream.

Under the hood, when your application calls g_input_stream_read_async(stream, …, cancellable, …), GLib will ensure your application drops into a poll() both the socket fd and the eventfd, awaiting input. Calling g_cancellable_cancel will cleanly break your application out of the poll(), and manifest as a G_IO_ERROR_CANCELLED error.

I originally wanted to have a comparison with Java, but it would make this blog entry far too long. Suffice to say it’s been an evolutionary learning experience for them, moving away from threads for everything.

The GCancellable model has been very successful for GLib, and its consuming applications such as GNOME (both by language bindings and pure C/C++ applications) – the application programmer can write reliable cancellable asynchronous operations, and under the hood GLib makes efficient use of available operating system primitives.

Montréal summit result: GSubprocess

I always enjoy the annual GNOME (Boston|Montreal) Summit. It’s small enough but has enough core GNOME contributors show up that it can be a focused and productive event. I’m happy to have finally landed GSubprocess in GLib. The venerable g_spawn_async_with_pipes only gave you raw platform-specific file descriptors which are hard to work with. GSubprocess instead leverages the standard streams in GLib that provide nice asynchronous I/O.

Furthermore, spawning processes is something that’s easy to get wrong. In particular, constructing the argument vector via strings makes it very easy to end up with shell injection if some of the arguments like filenames end up being provided via a potentially hostile attacker. Here the venerable libc system() API and its clones in poorly engineered runtimes like PHP are at fault for making it easy to do the wrong thing.

Furthermore, system() is very inflexible; if you want to do something as simple as capture the output, you have to turn to other APIs like popen() which still share the flaws. This was only recently fixed in POSIX with the addition of posix_spawn.

Now, the Python subprocess module is an example of what I consider a well-designed API for spawning child processes. You’re encouraged to give the arguments as an array, not a string. It’s possible to redirect output with pipes, and its communicate() API makes it easy to both read/write and wait for process completion.

When dealing with nontrivial subprocess code, you quite typically need to handle multiple events. This is where Python is weak, and it gets to why I think GSubprocess is interesting; because it (like all the other classes in Gio) leverages GLib’s builtin mainloop, it’s not hard to do things like spawn multiple processes, handling input and output asynchronously, as well as handling whatever other events your application needs to process, such as terminal or X/Wayland input.

The intent of GSubprocess is to be a robust base on which higher level classes can be written, depending on application need. While most of the time it’s obviously better to use shared libraries, that’s not always possible. Having GSubprocess in GLib fills an obvious gap in the stack, and I’m glad we’ve finally landed it in GLib.

ostree v2013.6 released

I’ve been working on Free Software for a long time now, on many different projects over the years. In approximate order: Emacs, a variety of things in Debian (PowerPC port, GNOME packaging, GPG verification for apt, CDBS, SELinux porting), then Rhythmbox, then many years of Fedora and Red Hat Enterprise Linux, DBus maintenance, a side stint in web services, and of course my favorite project, GNOME. Which in turn includes working on a lot of infrastructure like systemd, accountsservice, gdm, polkit. A lot of these have been very fun to work on, and mostly useful contributions.

But I think the most valuable contribution to Free Software I will have made will be my most recent project, OSTree. I’ve just released version 2013.6.

The documentation provides an overview in more depth, but briefly here: it’s a tool for parallel installation and atomic upgrades
of general-purpose Linux-kernel based operating systems, and designed to integrate well with a systemd/GNU userspace. You can use it to safely upgrade client machines over plain HTTP, and longer term, underneath package systems (but above filesystem and block storage layer; use whatever you want there).

I have been obsessed with upgrades for a long time, on and off. (Side note: interesting that KSplice does now exist). How OS upgrades work affects everything in the system architecture, from the development process to end system reliablity. It’s a problem domain that spans client machines, cloud deployments, traditional servers, and embedded devices.

Atomic and safe upgrades

Over the past few years, my interest in the domain rekindled for several reasons. One is that I happened to stumble across NixOS. While obviously there are a lot of software deployment mechanisms and such are out there, if you filter by “has atomic and safe upgrades”, the list becomes quite small and most of what’s there is quite specialized. So I studied Nix carefully; the executive summary is that while they have cool ideas, rebuilding or redownloading the entire system for a glibc security update is deeply impractical. But their approach of having a symlink farm which is the target of a bootloader entry stuck in my mind.

Also in the last few years, Chromium OS appeared with its autoupdater. The Chromium OS updater is an extremely efficient design…for their use case. But it’s hard to generalize the design to the wider world; doubling the disk space usage in every cloud image is a rather large penalty. Furthermore, the Chromium OS model doesn’t have much of a story for locally generated systems. If you want to customize the OS, you are completely unable to reuse their updates server, as it is all about deltas between fixed disk images. Again, this all makes sense if your model is that the only apps are web apps, but that’s a very fixed use case.

In short, OSTree is more efficient than Nix in a number of ways, and most importantly only handles filesystem trees; it’s not a package system. I posted plans for approaching the efficiency of the Chromium OS updater on the wire.

But, if OSTree is so cool, why isn’t it powering your package system? The simple answer is because it’s really quite deeply invasive for existing package systems like dpkg/rpm and all the others that are basically just clones of the same idea with different names. This quote from a LWN commenter sums it up:

What OSTree is proposing is somewhat unclear, but appears to require rebooting on *every single package upgrade* so as to switch into a new chroot containing that package. That means several times a day for me. Not a bloody chance am I letting something like *that* near any of my systems outside a VM: it is transparently ridiculous and optimizing for a very few programs that might need to take extra measures to avoid being broken by updates happening underneath them…

Right. On the plus side, you get atomic upgrades, and this is a tradeoff that a substantial number of people would likely take. Ultimately of course, as I replied to the commenter, it’s certainly possible to imagine carefully engineering the OS so that a certain subset of changes can be “live applied”, while still preserving atomic upgrades. Furthermore, while OSTree does not come with or force any particular independent application installation mechanism, it is designed to provide a fundamental layer for existing and new ones.

Parallel installation, the OS development process, and system quality

The core of OSTree is so simple – it’s just booting into hardlinked chroots – that it was relatively easy to enable something else besides atomic upgrades, which is easy parallel installation of operating systems. Not only does it make it easy to dual boot say a stable OS and the bleeding edge, if you have the disk space, you can thousand-boot, or more.

Why is this so critical? It’s because while package systems have a lot of flexibility, there’s one extremely important gap: The ability to try new code, and go back if it doesn’t work. This was covered in my original GUADEC presentation. Typically package-based distributions manage this by creating several different layers. Debian has stable, testing, unstable, and experimental. But if you upgrade from stable to unstable to see if suspend works for example, the package system will fight you trying to downgrade; the concept of “newer is better” is baked deeply into dpkg/rpm and everything built on top.

Being a software engineer working on an extremely complex general-purpose system like GNOME without massive development resources, let me tell you – it’s easy to break things unintentionally. Having a subset of users (but not everyone) that run the bleeding edge, like Firefox has in Nightly would be a real benefit, while also giving them a mechanism to fall back to the previous working build. And in fact, I have a separate project gnome-ostree that’s intended to be exactly that. Although it has an uninspired name, it’s better than “nightly” – it’s fully continuous, updated easily 70 times a day as git commits are made. But while it serves as an important testing base for validating the core OSTree designs in a relatively constrained scenario, it’s a separate project, and not the topic of this blog post.

OSTree underneath package systems

There are a large number of systems which fit into the model of efficiently replicating pre-constructed OS trees from a build server; many basic “client” workloads as well as cloud deployments are best delivered this way. That said, the “package” model where filesystem trees are computed dynamically on individual machines is very flexible, and some of that flexibility is entirely valid. Particularly for organizations which have invested heavily in it, it doesn’t make sense to toss out that investment; I want to support it.

While I’ve been relatively quiet about OSTree so far, I think it’s finally reached a point in implementation quality and design where I’d like to see more package system maintainers and distributions attempt to experiment with it; that’s the goal of this blog post. A quick weekend hack a while ago resulted in fedora-ostree. Since then, I worked on it a bit more this weekend, and updated it.

This is a long term effort; as the LWN commenter above said, OSTree has wildly different tradeoffs from existing package system semantics. There is a new section of the OSTree manual describing changes that many existing general-purpose distributions will have to make to adapt.

And clearly, hashing out a design where some changes can be applied live (after they are atomically set up for the next boot) would be really nice. If you’re logged into a system and want to zypper/yum/apt-get install strace, there’s no reason since that’s just a new file in /usr/bin that we can’t just make it appear right away. But as you go up from there in complexity, it gets more difficult to do without race conditions. But luckily, we have the complete source code to the operating system; and starting from a fundamental basis of reliability and safety, it is much easier to add features like speed and flexibility.

If you too share my passion for atomic upgrades, operating system upgrade engineering, continuous integration and such, then check out the git repository and join the mailing list; it’s a great time to join the project, as there are several new contributors, and it’s just fun to work on!

GNOME DevX, FOSDEM

Back now from the GNOME Developer Experience hackfest and FOSDEM. I was a bit late arriving, and entered during the perennial Python/JavaScript/vala/etc. discussion.

Developer Experience

Languages

Let me write down my current thoughts on this – they’re a bit nuanced. First, this previous entry still stands. The big picture of the architecture is correct, where a bindable subset of C is the primary interface definition. Another way to say this is that there are no plans to, for example, change GTK+ to require JavaScript.

C/GObject is just the interface definition – it’s possible to write GObject libraries in C++. For example: pango is now partially a frontend around harfbuzz. What’s not possible and will never be sanely is some sort of mechanism where one language calls into a component in another.

Let me highlight the excellent work of Jasper St. Pierre on gobject-introspection‘s documentation generation over the hackfest. We’re getting quite close to an initially usable version; this is something people have wanted for a long time.

Applications

This was an interesting discussion – lots of ideas going around. Some of it is quite very far from realization (like a complete sandbox), so the key I think is going to break it down into individidually useful steps that can be iterated on. There’s also a lot of prior art here as well, but the glick2 page explains why it’s not about just bundles – they need to integrate with the system. For example, we still want applications to use GSettings, and system administrators should be able to have a global view of application configuration, with mandatory controls, etc.

Thanks to:

BetaGroup Coworking for providing space, and:

GNOME foundation and Red Hat for sponsoring travel!

FOSDEM

Awesome, overwhelming, informative, and fun. Lots of great discussions with a wide variety of people across the FOSS world. Really looking forward to next year.

Colin Walters

I put my pants on just like the rest of you — one leg at a time. Except, once my pants are on, I make Free Software.

Author Archives: Colin Walters