There was an interesting article in the New York Times recently about drawing fuel from the air. Pretty cool. However, there’s another kind of power from the sky I want to talk about – Amazon’s Web Services, which give you unlimited computing in the “cloud”, with just a credit card.
Scaling the Mugshot polling
Recently I’ve been working on improving the way Mugshot does its polling for changes in various things like RSS feeds, Last.fm and Flickr APIs, etc. Previously, we parsed and stored RSS feeds as individual entries in our relational database. This worked fine for a while, but as we grew we started spending a lot of time reading the feeds out of the database, just to compare their content to the new version to decide when something changed. A much simpler way to do this is just to store a hash value (say SHA1), and compare it to the hash of the new version – no database involved. If the hash has changed, then we do a full parse and check for changes.
The other thing I wanted to do was easily allow for the feed polling load to be spread out across a number of machines. We don’t yet have enough feeds where we really need multiple machines, but it seems like something that would be good to be ready to parallelize. I’m calling the new system Firehose; it’s not really a very complex codebase, mostly I spent a lot of time dealing with AWS binding issues; updating the Python bindings for AWS needed to be updated for the new SQS API, and I had to toss together a lightweight Java binding for SQS too because of an obscure conflict between dependencies of typica and our version of JBoss. Anyways, that is the boring stuff.
Entering the cloud
In a very brief summary, the key Amazon services are:
- Key/value pair persistent storage (S3)
- Compute nodes (EC2)
- A messaging system (SQS)
I’m not going to drown you in buzzwords here – but it is very important to understand just how different this system is than a LAMP stack or J2EE app server cluster. First, all of these services are web scale; the key architectural point here being that by adding nodes, the service just gets faster and more reliable. However, these APIs also require a very different way of writing your application.
For example, the architecture behind Firehose is that there is a master process which simply maintains in-memory a set of pending tasks. It then dynamically allocates a set of slave servers to process the active tasks, using the EC2 API. In other words, the application is automatically provisioning new servers. It scales itself. If there is a sudden influx of new feeds to process, no intervention by me is required to allocate new servers or storage. If we delete tasks (for example, if we start an agreement with a service provider such as Last.fm to have them efficiently push changes to us rather than requiring polling), then the service scales itself down.
EC2 is pretty neat – it gives a feeling of unlimited raw power at your fingertips. I know that at any time, on demand, I could spin up any number of servers I want, have them process a lot of data, and then shut them down when I’m finished, all for a price of cents per hour. If I stop running any servers, it doesn’t cost me any money.
There are many things that fit into this dynamic batch-processing model. For example, this article by the creator of the Python bindings for EC2 shows how to batch-convert a large video collection. A use case topical for Fedora would be infinitely scalable software rebuilds; no more getting held behind kernel and openoffice rebuilds. As the large North American engineer group comes online in the morning and afternoon, the system could scale itself up, allocating more servers. At night when load is less, the system deallocates servers, saving money.
Coding for the cloud
What I find most interesting though is new APIs for writing server applications. These APIs are not all of the same complexity – EC2 is mostly just a fancy fork() that actually allocates a new OS image. Of the APIs, S3 is the most interesting. In a future blog post, I’ll talk about storage, because how you store your data has a profound effect on application architecture. Developers before now have generally coded to either the POSIX filesystem API, or relational storage. But S3 is not POSIX and it’s not relational. Would it make sense for the operating system (Linux) itself to provide a default API for storage that isn’t POSIX or relational?