tar::Builder isn’t Send

I recently made a new project in Rust that is generating multiple bootable operating system disk image types from a "pristine" image with the goal of deduplicating storage.

At one point I decided to speed it up using rayon. Each thread here is basically taking a pristine base (read-only), doing some nontrivial computation and writing a new version derived from it. The code is using .par_iter().try_for_each(); here the rayon crate handles spinning up worker threads, etc.

That all worked fine.

Then later, due to some other constraints I realized it was better to support writing to stdout in addition. (This code needs to run in a container, and it’s easier to podman run --rm -i myimage --output stdout > newfile.iso instead of dealing with bind mounts.)

I came up with this:

enum OutputTarget<W: std::io::Write> {
    Stdout(W),
    Tar(tar::Builder<W>),
}

Basically if you’re only asking for one file, we output it directly. If you ask for multiple, we wrap them in a tarball.

But, it didn’t compile – there was an error message about tar::Builder not having the Send trait that pointed at the closure being passed to the rayon try_for_each(). I’ve been using Rust long enough that I understand Send and immediately realized the problem: multiple worker threads trying to concurrently write to the same tar stream just can’t work. (And the same is true for stdout, but the compiler can’t know for sure there’s only one thread in that case.)

But, I still wanted the parallelism from doing the actual file generation. Some refactoring to more cleanly split up "generate files" from "output files" would have been cleanest, and probably not hard.

But this project was still in the fast iteration/prototyping phase so I decided to just wrap the OutputTarget enum to be an Arc<Mutex<OutputTarget>> – and that compiled and worked fine. The worker threads still parallelize generation, then serialize output.

Other languages don’t do this

This project is one of those that honestly could have easily started in bash or Python too. Or Go. But those languages don’t have built-in concurrency protections.

Out of curiosity I just wrote a quick Python program to write to a tarfile from multiple threads. As expected, it silently generated a corrupted tarball with intermixed content. (At this point hopefully everyone knows basically to avoid threads in Python since they’re mostly useless due to the GIL, of course)

And also as expected, a lightly modified example of the code from the Go archive/tar example compiles fine, and generates corrupted output. Now this is a well known problem in Go given its heavy focus on concurrent goroutines, and to be fair go run -race does correctly find errors here. But there’s a bunch of tradeoffs involved there; the race detector is only probabilistic, you have to remember to use it in your CI tests, etc.

I’m really not saying anything here that hasn’t been said before of course. But this was my experience this week. And it’s experiences like this that remind me why I sunk so much time into learning Rust and using it for new projects.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s