In a previous entry, we briefly looked at Amazon’s Web Services from a high level. Now that I have my application running reasonably well and debugged, I wanted to write a bit more about my experience. I’m not claiming true expertise here, but I’ve done my best to learn what I can and hope I can pass on some of this in a comprehensible way.
So it has certainly been a learning experience trying to understand the overall picture for how one writes an application on top of these APIs. As background material in this domain, I found the Google FS, Bigtable, and Map/Reduce papers useful, as well as the Building Scalable Websites O’Reilly book by a Flickr developer. If you don’t read anything else, read the Map/Reduce paper at least – every programmer should understand it. Coming from the perspective of an OS/desktop developer, I personally found the GFS and Bigtable papers the most interesting as compared to the POSIX file APIs and SQL, respectively.
“Everything you know is wrong.” – Marble Madness
One of the most important things I’ve learned is that there are two kinds of “scalable”. One is the kind of “scalable” that MySQL clustering, JBoss clustering, etc. offer. These systems take you from one machine to smaller values of N. They’re typically based on UDP broadcasts or the like.
The other kind of scale is called “web scale” – this is where your application is a completely distributed system, running in multiple data centers. No one machine is truly critical. Your application gets just faster as you add machines.
What’s the tradeoff between “scalable” and “web scale”? The answer is pretty simple. Your application has a number of facets such as reliability, consistency, and availability. Researchers have essentially come to the conclusion that you can’t have all 3 at the same time as load increases, and of the 3 you almost certainly want to sacrifice (immediate) consistency. If I write a new review of a book on Amazon, someone hitting Reload on the same page a few seconds later might not see it. If I receive an email in GMail, it might fail to be in the search index. If I delete a picture from Flickr, it might still be in my “photostream” display. But one property the system can have is eventual consistency, on the order of minutes or even seconds perhaps, but not immediate. My review should eventually appear to the person pressing refresh, the email will appear in the search index, and my picture will truly be gone (from the UI, anyways).
On Relational Databases
If you look at almost any modern web development framework like JBoss Seam, TurboGears, Rails, etc etc., at the heart is a relational database for storage. Using a relational database lets you effectively push most of the hard problems like persistent storage and concurrent access onto an external system (though that could be in-process using SQLite). There are some very smart people who developed SQLite, MySQL, Postgres, etc. A relational database gives you a lot for free, and unless you know what you’re doing, you should probably not attempt to store your data directly using say the POSIX file APIs (and this is true on the desktop side too, but that’s another blog entry).
In other words, for a lot of applications, relational databases are exactly the right solution. You can get very far with a relational/small cluster system; Wikimedia is an example here, though their job is obviously made much easier by the fact that reads truly dominate access.
Also worth mentioning is the approach of partitioning a relational database; though at this point you’re starting to move away from the normalized relational model, since you can no longer perform an operation over your entire data set.
So you’ve decided to write a distributed application
It should come as no surprise that there’s no magic API to code to in order to create a web-scale system. What you need to do depends on your application. You need to have an understanding for how data flows into your system, what kinds of operations you need to perform on it, where and how you can sacrifice consistency, etc. There’s a lot of more implementation/deployment level things like load balancing and tuning that matter a lot too.
However, there is a sort of general approach of asynchronous, cached data generation. In the relational model, every web page view creates a transaction which gets the data it needs, then returns. This way your users always see up-to-date information. But in the web scale model, you can massively denormalize by precomputing data. The advantage is a web page request goes no where near a database; you simply pull the cached data from storage close to the edge web server. When you get a mutation in your system, you trigger an asynchronous (e.g. map/reduce) process to regenerate all the cached data. This may not be instant, but with some intelligence in your system you should still be able to have eventual consistency.
Writing a bug tracker
Let’s say you’re writing a bug tracker. In a traditional web app, you’d have a form which would do a POST to send an updated comment. In the handler for that POST you open a transaction, and append a new row to your Comments table. But in a distributed system, you could generate cached data for each bug. A GET request for that bug is just a simple transformation of the cached data (perhaps XSLT, or maybe you save the final HTML). But the POST request to add a comment put is into a reliable work queue (like SQS). Another machine reads from this queue and triggers a regeneration of the cached bug data – which could be more than just the cached data for that particular bug. Let’s say you want to be able to display a list of all open bugs. One way to handle this would be to store in S3 a key /open/ for each bug. Then a query of open bugs maps to a listing of all keys with that prefix. But probably what you really want for these kinds of metadata operations is a system like Hypertable or SimpleDB.
A new kind of operating system?
If you look at Amazon’s APIs, they provide the root services you’d expect of an operating system: storage (key/value as opposed to hierarchical), IPC (reliable/persistent), and task creation (Xen image as opposed to process/thread). The challenge though is rewriting applications to use new APIs.
I mentioned in the previous entry that I thought storage was the most important of these 3. There are a lot of messaging systems out there (in particular AMQP looks pretty good), and virtualization APIs abound. But there are many fewer high-profile storage APIs. Storage is hard, and rewriting applications for a new form of storage is not at all trivial. That’s why so much is invested in POSIX-compatible storage systems – if you have that, then you instantly have Apache, rsync, and probably on the order of hundreds of thousands of other useful applications. But there are limits to the POSIX API; the Google FS paper discusses those. Just as there are limits to the relational model. What is going to be interesting over the next few years is to see which of these new APIs start to win application developers, and how the traditional database-based free software development stacks adapt.