Efficiency of git versus tarballs for source code transmission and storage over time

In GNOME, for various reasons (mostly historical), as part of the release process we still take our git repositories and run autoconf/automake on developer machines, and upload the result to the FTP server. One question I had today is – how many times do I as a developer need to download separate versions as a tarball before it would have been more efficient to just download the entire history as a git repository?

The answer to this obviously varies per repository. It’ll be a function of variables such as the length of the history of the module, whether or not it has large static assets (e.g. png images), etc. Let’s take a module I maintain, gobject-introspection. It has a nontrivial history, dating back to 2005, and has seen periods of peak activity, then has been fairly constant after that.

What we want to compare here is the size of tarballs to the size of the packfile that git will serve us. Here’s how we do it:


$ ls -al gobject-introspection-1.33.2.tar.xz
-rw-rw-r--. 1 walters walters 1.1M Jun  5 11:58 gobject-introspection-1.33.2.tar.xz
$ git repack -a -d
Counting objects: 18501, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (3499/3499), done.
Writing objects: 100% (18501/18501), done.
Total 18501 (delta 14971), reused 18501 (delta 14971)
$ du -sh .git
7.8M    .git

This means that for gobject-introspection, if I end up downloading the source code more than 7 times in tarball form (xz compressed), it would have been more efficient to download git instead. How about gtk+, which has a significantly longer history? The answer there is 16 times (current tarball is 13M, git repository is 213M). How about gnome-documents, which has a much shorter revision control history? Just 3 times!

A naive source code storage system based on tarballs would keep all of them as separate files, so what we’ve looked at above for network transmission also would apply in that case to at-rest storage. Anyways, just some data…


10 comments

  1. I have a related question.
    How many distinct tarballs have been made available in the time span covered by the git history.

    For example.. gnome-documents. In the time span covered by the git history, how many distinct release tarballs were made available in that time? And what is the size of that collection of tarballs?

    the other extreme.. how many gtk+ release tarballs were made available in the timespan covered by git? And the accumlative size of all those tarballs.

    -jef

    • I can confirm that. A shallow clone of gobject-introspection produces a 1.3M .git directory, barely larger than the tarball. The size difference there almost certainly comes from the use of xz on the tarball, versus git’s zlib compression.

  2. I think git would be enough **if** it had identical content to the tarball – I assume it doesn’t though. I think you are missing that the strength of tarballs lies in the fact that they are self-contained or self-hosting if you like. People like containers. But I hear you and understand where you are coming from. 🙂

    • Right, identical content is important. But we can store the exact contents (not the tarballs itself) of our distributed tarballs in a git branch.

  3. If you want to compare a git pack to a series of tarballs, you should use “git repack -a -d -f” rather than just “git repack -a -d”. Adding -f ensures that git will do full delta compression using all the available objects; without -f, git will assume that the deltas used in the existing packs suffice.

  4. I just checked your numbers with gobject-introspection, and got quite different results. I cloned git://git.gnome.org/gobject-introspection , and got a 5.7M .git directory; repack produced a 5.4M .git directory.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s