Drawing power from the sky, part 2

In a previous entry, we briefly looked at Amazon’s Web Services from a high level. Now that I have my application running reasonably well and debugged, I wanted to write a bit more about my experience. I’m not claiming true expertise here, but I’ve done my best to learn what I can and hope I can pass on some of this in a comprehensible way.

So it has certainly been a learning experience trying to understand the overall picture for how one writes an application on top of these APIs. As background material in this domain, I found the Google FS, Bigtable, and Map/Reduce papers useful, as well as the Building Scalable Websites O’Reilly book by a Flickr developer. If you don’t read anything else, read the Map/Reduce paper at least – every programmer should understand it. Coming from the perspective of an OS/desktop developer, I personally found the GFS and Bigtable papers the most interesting as compared to the POSIX file APIs and SQL, respectively.

“Everything you know is wrong.” – Marble Madness

“scalable”

One of the most important things I’ve learned is that there are two kinds of “scalable”. One is the kind of “scalable” that MySQL clustering, JBoss clustering, etc. offer. These systems take you from one machine to smaller values of N. They’re typically based on UDP broadcasts or the like.

The other kind of scale is called “web scale” – this is where your application is a completely distributed system, running in multiple data centers. No one machine is truly critical. Your application gets just faster as you add machines.

What’s the tradeoff between “scalable” and “web scale”? The answer is pretty simple. Your application has a number of facets such as reliability, consistency, and availability. Researchers have essentially come to the conclusion that you can’t have all 3 at the same time as load increases, and of the 3 you almost certainly want to sacrifice (immediate) consistency. If I write a new review of a book on Amazon, someone hitting Reload on the same page a few seconds later might not see it. If I receive an email in GMail, it might fail to be in the search index. If I delete a picture from Flickr, it might still be in my “photostream” display. But one property the system can have is eventual consistency, on the order of minutes or even seconds perhaps, but not immediate. My review should eventually appear to the person pressing refresh, the email will appear in the search index, and my picture will truly be gone (from the UI, anyways).

On Relational Databases

If you look at almost any modern web development framework like JBoss Seam, TurboGears, Rails, etc etc., at the heart is a relational database for storage. Using a relational database lets you effectively push most of the hard problems like persistent storage and concurrent access onto an external system (though that could be in-process using SQLite). There are some very smart people who developed SQLite, MySQL, Postgres, etc. A relational database gives you a lot for free, and unless you know what you’re doing, you should probably not attempt to store your data directly using say the POSIX file APIs (and this is true on the desktop side too, but that’s another blog entry).

In other words, for a lot of applications, relational databases are exactly the right solution. You can get very far with a relational/small cluster system; Wikimedia is an example here, though their job is obviously made much easier by the fact that reads truly dominate access.

Also worth mentioning is the approach of partitioning a relational database; though at this point you’re starting to move away from the normalized relational model, since you can no longer perform an operation over your entire data set.

So you’ve decided to write a distributed application

It should come as no surprise that there’s no magic API to code to in order to create a web-scale system. What you need to do depends on your application. You need to have an understanding for how data flows into your system, what kinds of operations you need to perform on it, where and how you can sacrifice consistency, etc. There’s a lot of more implementation/deployment level things like load balancing and tuning that matter a lot too.

However, there is a sort of general approach of asynchronous, cached data generation. In the relational model, every web page view creates a transaction which gets the data it needs, then returns. This way your users always see up-to-date information. But in the web scale model, you can massively denormalize by precomputing data. The advantage is a web page request goes no where near a database; you simply pull the cached data from storage close to the edge web server. When you get a mutation in your system, you trigger an asynchronous (e.g. map/reduce) process to regenerate all the cached data. This may not be instant, but with some intelligence in your system you should still be able to have eventual consistency.

Writing a bug tracker

Let’s say you’re writing a bug tracker. In a traditional web app, you’d have a form which would do a POST to send an updated comment. In the handler for that POST you open a transaction, and append a new row to your Comments table. But in a distributed system, you could generate cached data for each bug. A GET request for that bug is just a simple transformation of the cached data (perhaps XSLT, or maybe you save the final HTML). But the POST request to add a comment put is into a reliable work queue (like SQS). Another machine reads from this queue and triggers a regeneration of the cached bug data – which could be more than just the cached data for that particular bug. Let’s say you want to be able to display a list of all open bugs. One way to handle this would be to store in S3 a key /open/ for each bug. Then a query of open bugs maps to a listing of all keys with that prefix. But probably what you really want for these kinds of metadata operations is a system like Hypertable or SimpleDB.

A new kind of operating system?

If you look at Amazon’s APIs, they provide the root services you’d expect of an operating system: storage (key/value as opposed to hierarchical), IPC (reliable/persistent), and task creation (Xen image as opposed to process/thread). The challenge though is rewriting applications to use new APIs.

Storage

I mentioned in the previous entry that I thought storage was the most important of these 3. There are a lot of messaging systems out there (in particular AMQP looks pretty good), and virtualization APIs abound. But there are many fewer high-profile storage APIs. Storage is hard, and rewriting applications for a new form of storage is not at all trivial. That’s why so much is invested in POSIX-compatible storage systems – if you have that, then you instantly have Apache, rsync, and probably on the order of hundreds of thousands of other useful applications. But there are limits to the POSIX API; the Google FS paper discusses those. Just as there are limits to the relational model. What is going to be interesting over the next few years is to see which of these new APIs start to win application developers, and how the traditional database-based free software development stacks adapt.

So you have some XML…

Recently I battled the Typica library for accessing Amazon Web Services. There was an obscure conflict between its JAXB dependency and our app server. JAXB is…well, at its root it compiles XML schemas using a mapping to Java classes. This is in theory cool – you can get completion in your IDE for web services. In practice though, it just feels to me like it’s not worth the complexity.

So, I needed to quickly roll my own Java bindings for AWS. Basically we need to do a HTTP POST request, and then parse a small returned XML document. For the first part, Apache Commons HTTP is quite good. But then I was in a situation in which I’m sure many developers find themselves – you need to get the data out of a small XML document. No need for schema validation, etc.

There are a lot of XML libraries out there. We could use the DOM API, which is familiar because it’s fairly standardized across languages like Python, Java, and JavaScript, despite how nonnative it feels in all of them. If you’re really a masochist, you’ll try to write a state machine with SAX. Then there are the bindings which target a specific library like JDOM for Java, ElementTree for Python, etc. These can be nice, but they’re also not portable if you happen to move between different languages (and many developers today probably move between JavaScript and a sane language on the server side). So, there is another option which gives you a decent API that’s also largely portable: XPath with DOM.

Here’s our sample document:

<ReceiveMessageResponse>
<ReceiveMessageResult>
<Message>
<MessageId>11YEJMCHE2DM483NGN40|3H4AA8J7EJKM0DQZR7E1|PT6DRTB278S4MNY77NJ0</MessageId>
<ReceiptHandle>Z2hlcm1hbi5kZXNrdG9wLmFtYXpvbi5jb20=:AAABFoNJa/AAAAAAAAAANwAAAAAAAAAAAAAAAAAAAAQAAAEXAMPLE</ReceiptHandle>
<MD5OfBody>acbd18db4cc2f85cedef654fccc4a4d8</MD5OfBody>
<Body>foo</Body>
</Message>
<Message>
<MessageId>0MKX1FF3JB8VWS8JAV79|3H4AA8J7EJKM0DQZR7E1|PT6DRTB278S4MNY77NJ0</MessageId>
<ReceiptHandle>X5djmi3uoi2zZ8Vdr5TkmAQtDTwrcd9lx87=:AAABFoNJa/AAAAAAAAAANwAAAAAAAAAAAAAAAAAAAAQAAAEXAMPLE</ReceiptHandle>
<MD5OfBody>37b51d194a7513e45b56f6524f2d51f2</MD5OfBody>
<Body>bar</Body>
</Message>
</ReceiveMessageResult>
<ResponseMetadata>
<RequestId>b5bf2332-e983-4d3e-941a-f64c0d21f00f</RequestId>
</ResponseMetadata>
</ReceiveMessageResponse>

And here’s the code:

XmlParseData parsed = XmlUtils.parseXml(post.getResponseBodyAsStream(), 
new String[] { "q", "http://queue.amazonaws.com/doc/2008-01-01/" });			
NodeList msgNodes = (NodeList) parsed.xpath.evaluate("/q:ReceiveMessageResponse/q:ReceiveMessageResult/q:Message", 
parsed.doc, XPathConstants.NODESET);
Message[] msgResult = new Message[msgNodes.getLength()];
for (int i = 0; i < msgNodes.getLength(); i++) {
Node node = msgNodes.item(i);
msgResult[i] = new Message(parsed.xpath.evaluate("q:Body", node),
parsed.xpath.evaluate("q:ReceiptHandle", node));
}

I think this is pretty clear; the straight DOM or SAX alternatives would have been notably more lines of code. The key part being the XPath expression /q:ReceiveMessageResponse/q:ReceiveMessageResult/q:Message which gives us all the Message nodes.

For more about parsing XML in Java (though largely applicable to other systems that have XPath bindings), see this DeveloperWorks article may be useful. Also, this is the wrapper class I used in the code sample. And yes, I am using new String[] to work around Java’s lack of hash literals.

Drawing (nearly) unlimited power from the sky

There was an interesting article in the New York Times recently about drawing fuel from the air. Pretty cool. However, there’s another kind of power from the sky I want to talk about – Amazon’s Web Services, which give you unlimited computing in the “cloud”, with just a credit card.

Scaling the Mugshot polling

Recently I’ve been working on improving the way Mugshot does its polling for changes in various things like RSS feeds, Last.fm and Flickr APIs, etc. Previously, we parsed and stored RSS feeds as individual entries in our relational database. This worked fine for a while, but as we grew we started spending a lot of time reading the feeds out of the database, just to compare their content to the new version to decide when something changed. A much simpler way to do this is just to store a hash value (say SHA1), and compare it to the hash of the new version – no database involved. If the hash has changed, then we do a full parse and check for changes.

The other thing I wanted to do was easily allow for the feed polling load to be spread out across a number of machines. We don’t yet have enough feeds where we really need multiple machines, but it seems like something that would be good to be ready to parallelize. I’m calling the new system Firehose; it’s not really a very complex codebase, mostly I spent a lot of time dealing with AWS binding issues; updating the Python bindings for AWS needed to be updated for the new SQS API, and I had to toss together a lightweight Java binding for SQS too because of an obscure conflict between dependencies of typica and our version of JBoss. Anyways, that is the boring stuff.

Entering the cloud

In a very brief summary, the key Amazon services are:

    Key/value pair persistent storage (S3)
    Compute nodes (EC2)
    A messaging system (SQS)

I’m not going to drown you in buzzwords here – but it is very important to understand just how different this system is than a LAMP stack or J2EE app server cluster. First, all of these services are web scale; the key architectural point here being that by adding nodes, the service just gets faster and more reliable. However, these APIs also require a very different way of writing your application.

For example, the architecture behind Firehose is that there is a master process which simply maintains in-memory a set of pending tasks. It then dynamically allocates a set of slave servers to process the active tasks, using the EC2 API. In other words, the application is automatically provisioning new servers. It scales itself. If there is a sudden influx of new feeds to process, no intervention by me is required to allocate new servers or storage. If we delete tasks (for example, if we start an agreement with a service provider such as Last.fm to have them efficiently push changes to us rather than requiring polling), then the service scales itself down.

EC2 is pretty neat – it gives a feeling of unlimited raw power at your fingertips. I know that at any time, on demand, I could spin up any number of servers I want, have them process a lot of data, and then shut them down when I’m finished, all for a price of cents per hour. If I stop running any servers, it doesn’t cost me any money.

There are many things that fit into this dynamic batch-processing model. For example, this article by the creator of the Python bindings for EC2 shows how to batch-convert a large video collection. A use case topical for Fedora would be infinitely scalable software rebuilds; no more getting held behind kernel and openoffice rebuilds. As the large North American engineer group comes online in the morning and afternoon, the system could scale itself up, allocating more servers. At night when load is less, the system deallocates servers, saving money.

Coding for the cloud

What I find most interesting though is new APIs for writing server applications. These APIs are not all of the same complexity – EC2 is mostly just a fancy fork() that actually allocates a new OS image. Of the APIs, S3 is the most interesting. In a future blog post, I’ll talk about storage, because how you store your data has a profound effect on application architecture. Developers before now have generally coded to either the POSIX filesystem API, or relational storage. But S3 is not POSIX and it’s not relational. Would it make sense for the operating system (Linux) itself to provide a default API for storage that isn’t POSIX or relational?

Internet Everywhere with CDMA and NetworkManager

I subscribed to Verizon Mobile Broadband recently, after having paid for Wi-Fi one too many times at coffee shops. Googling for my card model (Kyocera KPC680) before I ordered it seemed to turn up a few hits, but the instructions I’d found were all about setting up PPP manually in config files in /etc and the like. Seeing as this is 2008 and not 1998, I started checking out NetworkManager‘s support.


NetworkManager connected to Verizon EV-DO

It turned out my card was new enough that the kernel driver PCI lists didn’t include it – but adding it to the driver PCI ID list, and writing a .fdi file so it was flagged as a CDMA modem got me going. Both of these changes should be showing up in Linux kernel and HAL releases hopefully. So far it’s been very reliable, and all I have to do is click on the NetworkManager applet and pick “Auto CDMA Dialup connection”. Pretty cool, nice work NM team!

Free Software and Linux should not be a choose-your-own-adventure game

Fairly recently, I ran out of disk space. If you’re a developer, especially one who uses things like JHBuild, virtualization, etc. – this has almost certainly happened to you. The consequences are pretty unpleasant (search for disk space in the GNOME bugzilla).

This got me to wondering – why the heck didn’t GNOME have a low disk space notification? This morning I decided to look into that. Certainly a patch would be not more than 100 lines to do a libnotify notification, along with an action to say empty your trash.

It turns out that I (and at least Fedora) had been victim of two things. 1) autoconf 2) GNOME release policy.

What happened of course is that the code exists – it was just conditional on libnotify-devel being installed. From the Fedora gnome-volume-manager build log:

checking for NOTIFY… no

So the the support wasn’t being compiled in, for who knows how long. I just fixed this for Rawhide, but we need to stop thinking of Linux as a source of random bits that you can put together optionally, and instead think of it as one unified stack. That means if you’re a software maintainer – your check should not be if (have_libnotify_h), but instead if (linux) { require(libnotify_h) }. GNOME release policy totally ignores the fact that the vast majority of people using GNOME have these components. I understand the desire for things to be optional – we want GNOME to work on Maemo and new devices where not all of the stack may exist. But the right approach is for these new distributions to use flags like --disable-notify. Or we can use a semi-standardized flag like --enable-maemo, since often these non-desktop devices require other source changes too.

The future of programming language runtimes

There is a saying in linguistics, “A language is a dialect with an army and a navy. I think this applies in a way to computer programming languages too. The most widespread programming languages aren’t necessarily the ones that are “perfect” in an Esperanto sort of way, but the ones which have an army and a navy – i.e. large corporations backing them.

Fragmentation

The current fragmentation in the language runtime space is quite disheartening. It’s still best described by this 7-year old hub paper by Havoc and Owen. I hope they won’t be mind my linking to it. Examples of this fragmentation problem are still prevalent today. The difficulties in using gtkmozemebed from Python due to two garbage collectors and type systems in one process. To use the Linux inotify system from Ruby, you need a random third party binding that may or may not be up to date. If you want to use the Eclipse rich text editor in your next OCaml program, well you’re just screwed. The examples are almost endless…most pairs of (API X, runtime Y). Not all of those are sensible combinations, but a lot of them are.

Now, the problem with Hub I think was that it assumed you could still sensibly mesh unmanaged languages like C/C++ as they are today with managed VM-based runtimes like Python and Java. If you’ve been following the Mozilla project, you know that they are embarking on a huge effort to automatically rewrite their 2+ million lines of code to go from C++ with reference counting to C++ with garbage collection, because the problems with leaks between the two systems are too hard. (By the way, this stuff is really cool and worth watching as I’ve mentioned before).

In the GNOME community, we largely get by with systems such as PyGTK and Gtk# that do mix runtimes with reference-counted C because we’re not running untrusted code from the Internet, like Mozilla has to. If you forget to destroy that widget you removed but that still had signal handlers calling managed code connected to it, it’s a bug in your application that is your fault – not the fault of some college student writing his first JavaScript application that was an intern at a website you visited.

The technical future

This prediction has been made plenty of times for years, but I think it’s worth repeating. The future of language runtimes is pretty clearly going towards VM-based runtimes. C/C++ are dead ends for pretty much anything except a kernel. More modern AOT-compiled languages such as OCaml and Haskell have a lot to gain by switching to a VM. Concurrent, generational garbage collection isn’t trivial, and a modern VM gives you that. Being able to load code dynamically, or programmatically construct new optimized functions at runtime for specific data is very appealing. Debugging is significantly easier.

The political future

So, here we are in 2008. Is the situation much better? In some ways, yes. The Free Software community now includes Java. Popularity of truly open, community-driven languages and runtimes such as Python and Ruby seems to be rising. And finally, there is a project which is attempting to create a .NET base language runtime, Mono.

Mono

The Mono project is the marriage of two major ideas. A runtime designed for multiple languages is a very unifying concept. It would solve almost all of the issues Hub was attempting to address for managed, VM-style languages. It leaves C/C++ out unless they’re manually bound, but there’s no helping that.

The second idea though of Mono is that it is defined as a .NET clone. This idea has proven very divisive. One company, Microsoft, is the upstream. This is not a good idea for the Free Software community; we don’t have control of our technical destiny anymore. If we want to add a bytecode that makes Python run better, we have to wait for decisions to be made in some Microsoft boardroom.

If I was in charge of Mono, what I would do immediately is reach out to 3 communities that are much more clearly part of the Free Software world: Python, Ruby and finally – Mozilla Tamarin. Mono could say to these communities, “Hey, we’ll be your JIT, garbage collector, and common library runtime” – but with the freedom to adapt and change to the needs of these communities. Especially having Mozilla on board would be a coup, because it’s shipped to so many people. I certainly wouldn’t have a problem making my application depend on Firefox being installed on Windows.

The problem with separate projects like Jython and IronPython, JRuby etc., is that they’re essentially forks. They have connections with the upstream, but that’s not the same as being the upstream. If Mono can unify the truly open Free Software languages around one common runtime, it would greatly circle the wagons around

Java

Sun has finally realized that the idea that every single application would be written in Java and only needs a least-common-denominator API across every platform doesn’t work. Even more importantly, the JDK is now Free Software, and though the JCP isn’t ideal, it at least exists.

It certainly seems plausible to me to imagine that Sun would be willing to buy into a shared runtime that wasn’t .NET, but a true cross-corporation, community effort. They’re already trying to add new bytecodes to better support other languages, but instead we could translate IKVM-style from Java 5.0 bytecodes into a new shared runtime.

5 years from now…

I really hope I’ll be writing my Python app that

  • Gets efficiently JIT-compiled
  • Can access various OS components like inotify automatically as they’re added to a common shared library
  • Can seamlessly blend in a Web-like widget that executes other languages such as JavaScript without pain
  • Is able to call into frameworks like Hibernate

Oh, and I want my jetpack.

Hotwire hypershell 0.700 alpha released

Hotwire 0.700 (mirror) is now available.

So, about 8 months ago you may remember I got really frustrated with bash for carelessly losing my history, and in my spare time, I set out to create a new developer shell. Not just a plain old Unix shell; I wanted a shell that took advantage of my computer (using a modern language runtime, and integrating with the desktop environment), rather than being limited by the “terminal emulator” concept. I’d recently tried Windows Powershell around that time as well, and thought they had some good ideas but also made some bad design decisions. I thought it was possible to do better.


Looking for a firefox process

In short, I wanted to create a “hypershell” that was better than the Unix concept of a shell, and that was also improved on the Windows PowerShell model. Hotwire 0.700 is the first release where I think that goal is, if not achieved in many ways, much closer at hand. Previous versions were more like demo toys, but this is the first one I’d call alpha. In other words, usable in many ways, although far from complete. Many developers and power users will hopefully find it a suitable replacement for a terminal+Unix shell running as the same uid on a desktop. For system administrators, Hotwire doesn’t yet have a really strong story for improving the remoting experience (well, you will probably like the included ssh client), but I’m going to be taking a close look at things like func.

The new webpage (mirror) has a longer explanation of things, screenshots, etc. The ChangeLog is also a useful read, especially if you’re upgrading from a previous version. And don’t forget to try the updated tutorial to get a hands-on introduction to the system.

NYT on SimpleDB

The New York Times Open blog takes a look at Amazon’s SimpleDB. I have to agree with him, web-scale computing architecture is a fascinating topic. I’ve been meaning to set aside time to look at SimpleDB; it feels like the big gap in web-scale is standard/open APIs for structured data storage. EC2 gives you compute (and the API is fairly obvious), and S3 is your big key-value bucket, but until Amazon announced SimpleDB there were really only hacks for clustering MySQL. And that only gets you so far.

Another interesting project in this area is CouchDB.

How fast technology moves – BAN

Saw this short article about a new IEEE working group on a networking standard for devices in or on the human body. One cited use case is controlling a pacemaker from a wristwatch, which makes a lot of sense – it’d be pretty nice to be able to make small tweaks to devices like that without requiring open heart surgery.

What is kind of scary though is the security implications. The security disaster that was WEP shows us how important this stuff is to get right. Presumably a pacemaker or other critical device wouldn’t have a “turn off” function, but still…the implications sound like science fiction. Your wristwatch is pulling data from the web, gets cracked, and the next time you shake someone’s hand it tries to pass a digital virus to the devices in or on them…

Realistically though, pretty much everyone is far more likely to die in a car accident than have a digital virus for the forseeable future. So buckle up! But when it comes around, also be sure to apply security updates for your body =)

Stuff that has just gotten a lot better

Fedora

It was fairly recently, when I installed OCaml on my Fedora 8 system with a quick yum install that it really struck me just how far the Fedora project has come in the last few years. Long ago when I was active in Debian, one of the things I really loved about Debian was how comprehensive the package set was – if you were a fan of semi-obscure languages like OCaml, Debian was the place to be. I even gave a talk about that in 2002.

Back then in Fedora (maybe it was even RHL timeframe), the default OS was very well put together, but if you wanted to stray a bit outside of what was shipped on the single CD – you had to do stuff like manually download the RPMs from the OCaml website. No provision for automatic updates, etc. It kinda sucked.

Fast forward to today – Fedora is not just a well put together default desktop, but also a pretty growing comprehensive set of over 8000 binary packages, active subteams concentrating on areas like awesome Java support, virtualization, and plenty more.

It really does kick ass. For sure there’s a few things we need to do still like sudo by default, but the balance of things has certainly shifted.

Firefox

It’s hard to describe how much better the Firefox 3 CVS is compared to Firefox 2. The beta 1 release notes just don’t do it justice, you have to check out reviews from Ars Technica or elsewhere. The awesomebar is something that really is hard to live without after you’ve tried it.

But what I wanted to talk about in particular in this blog entry is some of the really nice Linux-specific integration work that’s landed recently. A screenshot gives you an idea:


Firefox 3 CVS as of 2007-11-30

Note in particular the native tab look and use of GTK+ stock icons; this is part of the gnomestripe theme. Not pictured is the native look for form field widgets, along with a lot of other under-the-hood work like Cairo.

The increased focus on good platform integration is driven by talented people like Alex Faaborg and others who realize that being a crossplatform application doesn’t mean you have to look and feel exactly the same everywhere – what you have to do is be a blend of the good parts of your application and the platform. So Firefox 3 on Linux has the same kickass addons and other aspects that make it Firefox, while at the same time feeling much more like a part of your desktop.