Quite a while ago I came across the SQLite testing page and was impressed (and since then it’s gotten even better). They’ve clearly invested a lot in it, and I think SQLite’s ubiquity is well deserved.
When I started the ostree project I had this in mind but…testing is hard. We have decent "unit test style" coverage since the start but that’s not very "real world". We’ve gone through a few test frameworks over the years. But to the point of this blog post: I finally had a chance to write some new testing code and I’m happy with how it turned out!
TL;DR: There’s a new "transactionality" test run on every PR that uses a mix of e.g. kill -9 ostree
and reboot -ff
while updates are running, and verifies that you either have the old or new system safely. (PRs: ostree#2048 and ostree#2127).
But along the way there were some interesting twists.
Test frameworks and rebooting
I mentioned we’d been through a few test frameworks. An important thing to me is that ostree is a distribution-independent project; it’s used by a variety of systems today. Ideally, our tests can be run in multiple frameworks used by different distributions. That works easily for our "unit tests" of course, same as it does for many other projects (make check
style tests that are nondestructive and run as non-root).
But our OSTree tests want a "real" system (usually a VM), and further the most interesting tests need to be destructive. More than that, we need to support rebooting the system under test.
I’d known about the Debian autopkgtest specification for a while, and when I was looking at testing I re-evaluated it. There are some things that are very Debian-specific (how tests are defined in the metadata), but in particular I really liked how it supports reboots.
There’s a big tension in test systems like this – is the test logic primarily run on the "system under test", or is it on some external system which manages the target via e.g. ssh
? We had lots of problems in our prior test frameworks was dealing with reboots with the latter style. Plus the latter style tends to strongly tie the test code to the test harness.
In the Fedora CoreOS group we use a system called "kola" which came from the original CoreOS project. It knows how to boot systems using Ignition in various clouds along with qemu. I added partial support for the Debian Autopkgtest specification to it (cosa#1528).
Avoiding shell script
A lot of the original ostree tests are in shell script. I keep finding myself writing shell even though I also keep being badly burned by it from time to time.
So another tangent along the way here: For writing new tests I’d resolved to use "not shell script". Python would be an obvious choice but…another large wrinkle here is that in CoreOS we don’t want interpreters in the base OS – they should run as containers (yes, a shell is obviously an interpreter too but…). So going the interpreted test route would drive us towards having our test framework run as a privileged container. I decided not to do this for a few reasons; the biggest is that makes it much harder to test the system as other processes see it.
My preferred language nowadays is Rust, and it generates static-except-libc binaries that we can just copy to the host. Further, fortuitously someone else created Rust bindings to ostree and I’d been wanting an excuse to use that for a while too! However…some things are just too verbose via API, and plus we want to test the CLI too. Invoking subprocesses via Rust std::process::Command is also very verbose. So I ended up creating a sh-inline crate for Rust that makes it ergonomic to include snippets of strict mode bash in the code. This snippet is a good example. I’d like to make this even more ergonomic too, but my proc-macro-fu isn’t there yet.
Actually writing the test
OK so all those prerequisites out of the way, the first thing I did was write the code to do the "try upgrading and while that’s running, kill -9 it". That went reasonably quickly and worked well, so I moved on to the more interesting case of adding reboot -ff
(simulating immediate power loss) as another "interrupt strategy". This excercises the whole stack through the kernel, particularly interactions with the filesystem.
However, this required completely rewriting the control flow because here the "test harness" is also being forcibly killed. We don’t want to rely on persisting our state to the disk on the system. I ended up serializing the process state into AUTOPKGTEST_REBOOT_MARK
, which gets stored in the harness and passed back when the process starts again. Effectively then the test code becomes a sort of coroutine with the harness.
Found problems
Depending on how you look at it, fortunately or unfortunately: none so far. One motivation for writing this test was to try to reproduce a bug a user filed that showed an error message from the boot loader configuration handling code. I haven’t managed to reproduce that yet. I did manually inject some faults in the code and verify that the test failed of course. And in the past I’ve of course done some manual testing to verify that ostree does what it says on the box for implementing transactional upgrades. But there’s clearly more to explore here.
Next steps
One thing I plan to explore next here is fault injection, probably with strace fault injection. This may also combine well with adding support for the harness to request explicit sleep()
calls to widen the window on possible races. Plus so far while I’ve mentioned support for other distributions, this is only testing Fedora CoreOS in its default mode; e.g. we’re only validating xfs
and not other Linux filesystems, etc.
Are we testing like SQLite yet?
Definitely not, but I’m happy that I made some progress closer to that goal! It was an interesting project and I’m looking forward to building more of it per above. Outside of OSTree, the goal of this blog was write down some of the "lessons learned" for others working in this space. For example, I hope some people working in the Linux-based OS testing space look at the Debian autopkgtest; it can be hard to come to consensus on test frameworks and standards, but there are at least some good ideas there. Also I think the mix of "Rust with some inline shell script" worked pretty well for these types of tests; particularly if the CLI outputs JSON, deserializing with Serde is great. Though taking the Rust compile time hit for tests is a downside.
But in the end, I can at least now say that every pull request to OSTree runs through a test suite that ensures it survives being forcibly terminated while an update is running. The integrity of your root filesystem is very important to me – it should be robust and image-like, but still a Linux system in the end. If this sounds good to you, I hope you check out one of the distributions that use it!
Dang, SQLite’s testing and release checklist is insane. 640 times more test code and scripts than program code! That’s an incredible level of quality assurance to aspire to reach. I wonder if any other project is more tested.