Drawing (nearly) unlimited power from the sky

There was an interesting article in the New York Times recently about drawing fuel from the air. Pretty cool. However, there’s another kind of power from the sky I want to talk about – Amazon’s Web Services, which give you unlimited computing in the “cloud”, with just a credit card.

Scaling the Mugshot polling

Recently I’ve been working on improving the way Mugshot does its polling for changes in various things like RSS feeds, Last.fm and Flickr APIs, etc. Previously, we parsed and stored RSS feeds as individual entries in our relational database. This worked fine for a while, but as we grew we started spending a lot of time reading the feeds out of the database, just to compare their content to the new version to decide when something changed. A much simpler way to do this is just to store a hash value (say SHA1), and compare it to the hash of the new version – no database involved. If the hash has changed, then we do a full parse and check for changes.

The other thing I wanted to do was easily allow for the feed polling load to be spread out across a number of machines. We don’t yet have enough feeds where we really need multiple machines, but it seems like something that would be good to be ready to parallelize. I’m calling the new system Firehose; it’s not really a very complex codebase, mostly I spent a lot of time dealing with AWS binding issues; updating the Python bindings for AWS needed to be updated for the new SQS API, and I had to toss together a lightweight Java binding for SQS too because of an obscure conflict between dependencies of typica and our version of JBoss. Anyways, that is the boring stuff.

Entering the cloud

In a very brief summary, the key Amazon services are:

    Key/value pair persistent storage (S3)
    Compute nodes (EC2)
    A messaging system (SQS)

I’m not going to drown you in buzzwords here – but it is very important to understand just how different this system is than a LAMP stack or J2EE app server cluster. First, all of these services are web scale; the key architectural point here being that by adding nodes, the service just gets faster and more reliable. However, these APIs also require a very different way of writing your application.

For example, the architecture behind Firehose is that there is a master process which simply maintains in-memory a set of pending tasks. It then dynamically allocates a set of slave servers to process the active tasks, using the EC2 API. In other words, the application is automatically provisioning new servers. It scales itself. If there is a sudden influx of new feeds to process, no intervention by me is required to allocate new servers or storage. If we delete tasks (for example, if we start an agreement with a service provider such as Last.fm to have them efficiently push changes to us rather than requiring polling), then the service scales itself down.

EC2 is pretty neat – it gives a feeling of unlimited raw power at your fingertips. I know that at any time, on demand, I could spin up any number of servers I want, have them process a lot of data, and then shut them down when I’m finished, all for a price of cents per hour. If I stop running any servers, it doesn’t cost me any money.

There are many things that fit into this dynamic batch-processing model. For example, this article by the creator of the Python bindings for EC2 shows how to batch-convert a large video collection. A use case topical for Fedora would be infinitely scalable software rebuilds; no more getting held behind kernel and openoffice rebuilds. As the large North American engineer group comes online in the morning and afternoon, the system could scale itself up, allocating more servers. At night when load is less, the system deallocates servers, saving money.

Coding for the cloud

What I find most interesting though is new APIs for writing server applications. These APIs are not all of the same complexity – EC2 is mostly just a fancy fork() that actually allocates a new OS image. Of the APIs, S3 is the most interesting. In a future blog post, I’ll talk about storage, because how you store your data has a profound effect on application architecture. Developers before now have generally coded to either the POSIX filesystem API, or relational storage. But S3 is not POSIX and it’s not relational. Would it make sense for the operating system (Linux) itself to provide a default API for storage that isn’t POSIX or relational?

Internet Everywhere with CDMA and NetworkManager

I subscribed to Verizon Mobile Broadband recently, after having paid for Wi-Fi one too many times at coffee shops. Googling for my card model (Kyocera KPC680) before I ordered it seemed to turn up a few hits, but the instructions I’d found were all about setting up PPP manually in config files in /etc and the like. Seeing as this is 2008 and not 1998, I started checking out NetworkManager‘s support.

NetworkManager connected to Verizon EV-DO

It turned out my card was new enough that the kernel driver PCI lists didn’t include it – but adding it to the driver PCI ID list, and writing a .fdi file so it was flagged as a CDMA modem got me going. Both of these changes should be showing up in Linux kernel and HAL releases hopefully. So far it’s been very reliable, and all I have to do is click on the NetworkManager applet and pick “Auto CDMA Dialup connection”. Pretty cool, nice work NM team!

Free Software and Linux should not be a choose-your-own-adventure game

Fairly recently, I ran out of disk space. If you’re a developer, especially one who uses things like JHBuild, virtualization, etc. – this has almost certainly happened to you. The consequences are pretty unpleasant (search for disk space in the GNOME bugzilla).

This got me to wondering – why the heck didn’t GNOME have a low disk space notification? This morning I decided to look into that. Certainly a patch would be not more than 100 lines to do a libnotify notification, along with an action to say empty your trash.

It turns out that I (and at least Fedora) had been victim of two things. 1) autoconf 2) GNOME release policy.

What happened of course is that the code exists – it was just conditional on libnotify-devel being installed. From the Fedora gnome-volume-manager build log:

checking for NOTIFY… no

So the the support wasn’t being compiled in, for who knows how long. I just fixed this for Rawhide, but we need to stop thinking of Linux as a source of random bits that you can put together optionally, and instead think of it as one unified stack. That means if you’re a software maintainer – your check should not be if (have_libnotify_h), but instead if (linux) { require(libnotify_h) }. GNOME release policy totally ignores the fact that the vast majority of people using GNOME have these components. I understand the desire for things to be optional – we want GNOME to work on Maemo and new devices where not all of the stack may exist. But the right approach is for these new distributions to use flags like --disable-notify. Or we can use a semi-standardized flag like --enable-maemo, since often these non-desktop devices require other source changes too.