Efficiency of git versus tarballs for source code transmission and storage over time

In GNOME, for various reasons (mostly historical), as part of the release process we still take our git repositories and run autoconf/automake on developer machines, and upload the result to the FTP server. One question I had today is – how many times do I as a developer need to download separate versions as a tarball before it would have been more efficient to just download the entire history as a git repository?

The answer to this obviously varies per repository. It’ll be a function of variables such as the length of the history of the module, whether or not it has large static assets (e.g. png images), etc. Let’s take a module I maintain, gobject-introspection. It has a nontrivial history, dating back to 2005, and has seen periods of peak activity, then has been fairly constant after that.

What we want to compare here is the size of tarballs to the size of the packfile that git will serve us. Here’s how we do it:

$ ls -al gobject-introspection-1.33.2.tar.xz
-rw-rw-r--. 1 walters walters 1.1M Jun  5 11:58 gobject-introspection-1.33.2.tar.xz
$ git repack -a -d
Counting objects: 18501, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (3499/3499), done.
Writing objects: 100% (18501/18501), done.
Total 18501 (delta 14971), reused 18501 (delta 14971)
$ du -sh .git
7.8M    .git

This means that for gobject-introspection, if I end up downloading the source code more than 7 times in tarball form (xz compressed), it would have been more efficient to download git instead. How about gtk+, which has a significantly longer history? The answer there is 16 times (current tarball is 13M, git repository is 213M). How about gnome-documents, which has a much shorter revision control history? Just 3 times!

A naive source code storage system based on tarballs would keep all of them as separate files, so what we’ve looked at above for network transmission also would apply in that case to at-rest storage. Anyways, just some data…

On software engineering and optimization

From Bryan Cantrill’s blog:

Adding a hook of this nature requires an understanding of the degree to which the underlying code path is performance-critical. That is, to contemplate adding this hook, we needed to ask: how hot is closef(), anyway? Historically in my career as a software engineer, this kind of question would be answered with a combination of ass scratching and hand waving.

Sadly too true in my experience as well. Today when I’m reviewing a patch and a performance increase is claimed, I always ask for numbers and methodology. You’d think this would be the norm – most of the advancement in our society over the last few hundred years lies with the scientific method, but the problem is it’s just too damn easy to modify software. Why bother actually measuring when we can just make a change, find out it’s broken later, then change it again immediately?

This gets to something that’s been on my mind lately, which is that we should only try to optimize for two things: latency, and power usage. The nice thing about this is that “traditional” tradeoffs like Space-time are neatly encapsulated by power usage, because RAM, CPUs/GPUs, and hard disks consume power. Is it a good idea to cache that file in memory (parse the file once, but forces the system to retain it in RAM, at a constant power draw), or re-parse it when we need it periodically, then discard the data (more CPU draw periodically, less constant RAM draw)? If you’re optimizing for power draw, looking at representative workloads would give you the answer. Even better, power usage is specific to particular machines, which is how real-world optimization works.