OS APIs: Windows 8/WinRT and GNOME/GTK+

November 30, 2011

If one is making an operating system, clearly the API that application authors use is extremely important. The whole point of an operating system is to run them. What I want to talk about is fundamental APIs, or the lowest stable level.

If you have even a passing awareness of the evolution of Microsoft Windows over the last 25 years, you know there have been a lot of APIs that have appeared, been promoted and marketed to Windows developers, and then either deprecated or relegated to just an “option”. For example, MFC. However, these APIs are all wrappers. The only time the fundamental Windows API broke incompatibly (I believe) that 16 bit applications don’t run on 64 bit; see X86-64. But the point here is that since the introduction of 32 bit Windows NT in 1993, if you coded to that API, your application will still run. If your application still runs, that means it’s the same operating system.

More recently, Microsoft for a while was promoting .NET heavily, arguably more than any of their frameworks before. It’s important to understand that at its introduction, .NET was still fundamentally a wrapper. For a while, there were rumors that it might become a fundamental API, but with the introduction of Windows 8, that won’t happen. For a really great read on this, see this ArsTechnica article. Especially fun to read are the bits about the politics, which the article only mentions in passing – you can find more scattered about the internet, like here.

Now Microsoft is saying something huge: all new APIs will be based on this new WinRT thing. Your old Win32 apps won’t break, but this time we have reason to believe they’re pretty serious – this really is the new fundamental API, and if you want your app to use new features, you will have to use WinRT. You can access WinRT from plain bog standard C, even if it’s not beautiful.

How does all of this relate to GNOME and GTK+? I think WinRT validates where we’ve been doing in GNOME with GObject Introspection. GObject may not be the most concise thing in the world (though it definitely beats C/COM), but the combination of a pure C base with added metadata and runtime support mean that all of our fundamental APIs (basically the GTK+ stack, and notably GIO for non-GUI programs) remain accessible to C and also available in other runtimes and languages.

There’s a lot more to do on introspection – we desperately need a complete documentation generator for example. It’s also pretty clear to me in that in order to truly succeed, we need to “downgrade” C to be a consumer of the API rather than a source, i.e. we need to do what Microsoft has done and define interfaces in an IDL. That will be interesting to do while still keeping around the old C APIs that don’t match the projected C binding.

TL;DR – I believe GNOME’s approach of using C with metadata and minimal runtime hooks as a fundamental operating system API is the right course, and we should keep doing what we’re doing.


Thoughts on being an upstream

October 9, 2011

I’ve been reading things people report in Bugzilla for years. How I feel about this now is that there are really several, entirely different things that we presently lump under “bug”. For example, I think it’s pretty clear that someone’s random ideas for a change to the design are totally different from say identified code regressions, which are in turn different from proposed patches.

Given that I am only one human and can’t respond to every bug, much less actually fix them, I’ve had to invent a prioritization mechanism. Note this is in my upstream GNOME role – I do work on Red Hat Enterprise Linux too, and there my priorities are obviously influenced by customers, management etc.

My goal in upstream is simply to make good Free Software. Thus, prioritization I’ve settled on looks something like this:

  • Is the bug reporter an important Free Software contributor to anywhere in the stack? Then the bug is important. If they’re also an engineer or knowledgable tester, then things are ideal because I can expect them to try things out and help me debug.
  • Is the bug a valid regression with enough data to debug it? For example, “I upgraded from GNOME 3.0 to 3.2, and application foo crashes with this stack trace…”. Then it’s important. Ideally it’s bisected – something I’d like to make easier for both developers and testers.
  • Is the bug something embarrassing? For example, badly leaking memory. Then it’s important.

Obviously, a bug could be all three of these things – e.g. a Mozilla hacker could report a regression in GNOME that causes us to leak memory. Then it’d demand a reply =) Things beyond this though get fuzzy. One thing that’s important to keep in mind is that GNOME does not have a business model that scales directly per user. Someone new using GNOME doesn’t necessarily mean there are more people working on it able to respond to, diagnose, and fix bugs. This is why as upstream, I focus so much on the things above – Free Software contributors (in some form) and unintentional regressions.

Why regressions? Well, I obviously have no obligation to someone who just happens to use and deploy my software for free – it’s not like we can e.g. work perfectly on all hardware in all situations. But it important for anyone contributing a change to the FOSS pool to make sure that while their change may take a step forward, it’s not taking two steps backward somewhere else.

I apply that rule strongly even to myself – while I don’t personally care about OpenBSD, I took the time to diagnose a regression I introduced, following the principles above. Hopefully others feel the same!


Montreal 2011

October 8, 2011

So I’m at the GNOME Summit in Montreal, and so far it’s looking good. We have a lot of sessions lined up, and a good collection of core GNOME hackers here, along with some interested outsiders.

One thing that’s been on my mind a bit is that I’m now approaching about 10 years of contributing to GNOME. Now, GNOME is two things – the people, and the code. We have a lot of good code, and some bad code. But when I originally got involved, it was more the people, probably most by Havoc Pennington. He’s one of those rare people that had a grasp of a lot of issues, and can both write well-reasoned English, and also write a lot of good code.

What I hope to do by both action and word is pass down to the next generation of GNOME hackers that are coming in now some of those principles and ideas that have guided the project, and heavily influenced me. Havoc’s blog has a lot of good stuff – I’d start with his Free Software UI post.

If you know me, then you also know that I really care about quality in Free Software. One thing Havoc doesn’t explicitly mention is that many preferences and settings (and system state) interact with other preferences and settings. In the GNOME 2 days for example, we just never figured out how upgrades would work if you customized the panel at all. For example, if you moved the clock, what would happen if a UI redesign integrated the clock closely with another applet? There’s a huge intersection of concerns here, and in practice it was entirely possible after upgrades you’d get a half-state between the two. It was gross.

In particular, what I as a software engineer live in mortal fear of is combinatorial explosion. The number of possible states exponentially grows as options and system states change, and that kind of thing makes QA and testing near-impossible. Besides just being bad UI, it makes the entire system buggy. And to me, that’s not what GNOME is about, because I’m following in the footsteps of those who came before.


Paying for email

September 19, 2011

I’ve recently been growing increasingly uncomfortable with my dependence on some of Google’s vast array of free (as in beer) products like Gmail. Now, I’m not really very concerned with privacy per se; what bothers me a lot more is the very unclear business model behind it. And particularly for email, which is (still for most people) the root of identity on the Internet.

From Google’s Q2 2011 earnings, we see that advertising on Google sites are 96% of its revenue. We don’t know how much of the advertising revenue comes from clicks on the search engine versus say GMail, but my guess is the GMail portion is very very small. Now I know GMail gets packaged up and sold to business as part of Apps For Your Domain, but how much income they get from that is unknown.

Under this premise that GMail is being subsidized by the core search business (which it may or may not be, I don’t know), what makes me particularly uncomfortable is that I’ve almost never clicked on a sponsored result in Google Search, and particularly not done so and paid for something. This means my email is effectively being paid for by other people. Who are these people and what are they doing? Seniors searching for online pharmacies I guess? I have no idea.

While at first I admit it was hard to resist “free”, I decided to just start paying actual money – directly – for email. By the recommendation of a few other people I chose Fastmail.fm, which is working out fairly well so far and I’m about ready to stop using GMail.

This means I’m also back to using Evolution. The only hiccup so far is that I find myself really, really missing the GMail conversation view. It’s simply a far faster and more effective way to read lots of email. Oh well, at least I have the source code to Evolution! I wonder what happened to that Summer of Code project for the conversation view…


Why I joined GNOME – fixing the clocks

August 22, 2011

Desktop Summit

Briefly – let me say it was great to talk to everyone I did at the Desktop Summit! I had a lot of fun. And thanks to Openismus for loaning us their offices for a successful introspection hackfest.

Clocks – maybe not so easy

So a little over 10 years ago, I wasn’t really a GNOME contributor. Actually at that time, I was becoming a Debian developer. I was making Debian run better on my (then PowerPC) MacBook, helping Debian/GNOME integration, and then after that moved into upstream GNOME where basically more interesting things happen than packaging (except for installers which sadly we haven’t pulled into GNOME and fixed yet, but that’s another story).

One thing I remember that really resonated with me a lot in 2001 was when Sun Microsystems contributed a Usability study to GNOME (the link there is obsolete, the study now lives here for the curious). Specifically, the section on clocks.

The usability study in general created a big collective realization for the GNOME project that we were basically just doing a lot of stupid and pointless things like having 4+ different clocks. That realization led to a big push for usability and simplicity, and you can find the results in things like Havoc’s preferences page and of course in the GNOME 2.0 release which followed.

A lot of Internet dwellers seem to interpret the cultural aversion in GNOME to excessive settings/preferences as us just being sadists, but what it really means to me is that there are real bugs to fix. And GNOME to me is really about that – people who believe in Free Software, but also making something usable and not buggy. Having a willingness to solve hard problems.

What does this have to do with clocks? Well, you see, it turns out clocks are a bit hard. During the Summit, Dave Airlie was complaining the GNOME 3 clock didn’t show the right time immediately after unsuspending, which honestly is pretty embarassing. Another example is leap seconds. If we can’t get the clock right, we should probably just give up =)

So I fixed it with a quick hack, but in reality there are other situations which the clock display needs to be aware of – for example, if the user changes the time via the Date & Time panel. We could bodge around this in userspace, but it turned out that relatively recently the Linux kernel had gained exactly the API we need. I spent some downtime after the summit working on it, the resulting bug for the curious lives here. One of the nice things with this approach is that when we’re showing an HOUR:MINUTE clock, we no longer need to wake up the display process once a second just to see if the system clock got set from underneath us, which is useful for power saving.

It’s just exactly the kind of thing that GNOME means to me – the tradeoff for us not working on supporting a combinatiorial explosion of options is that we spend our time making the core good. Are there more important issues to fix than the clock? Yes, but it’s a visible one and I was offline while traveling so it was something I could easily do then. Oh and also to continue to prove Linus wrong that no one uses Linux-specific interfaces by using them in GNOME =)


The transition from “them” to “we”

August 6, 2011

While a lot of large Free Software projects do have some sort of formal “membership” structure (e.g. for GNOME there is the foundation), in reality being part of a project is more about your mindset. It’s easy for anyone to complain from the outside about something – and then the project is “they”.

I think many people, even ones who have been in the FOSS community for a long time, forget that it’s often not that hard to transition to “we”. As long as you can point to some sort of contribution (that could be any of code, marketing, documentation, art), even just periodically. By contributing even in a small way, you make a fundamental shift from “they” to “we”. From consuming to producing.

One common problem though is that of those who do propose changes, is to start out with a controversial change. This is not the right way to do it – while you may succeed, it’s going to be an uphill battle. Start by contributing non-controversial things – in the mindset of many people working on a FOSS project, this builds up karma or goodwill. And it makes it far more likely the people involved in the project will listen to your concerns.

It’s pretty basic stuff really – but easy to forget apparently.


You want to be up for this

August 2, 2011

Since every else is doing it, let me pimp my talk! Yeah, it’s at 9am. But it’ll be worth it!

Desktop Summit 2011


Tweaking the Linux scheduler for “background” jobs

July 18, 2011

TL;DR version: If you’re building software (or some other “background” job) on your laptop, on Linux, prefix the build command with chrt --idle 0 instead of nice, e.g.: $ chrt --idle 0 make -j 12

So a while ago, a “Linux wonder patch” was making the rounds in the news. This patch happens to basically only be useful if you’re running things like make -j 12 on your laptop, which is hardly newsworthy really, and on top of that, many people (me included) think it’s a gross hack. This was discussed several times, but probably one of the best articles on it is a LWN article.

Now, in that thread, I argued that the Linux scheduler should do something better with the traditional Unix nice scheduler value, and Linus basically said that while nice wasn’t very useful, it did have some semantics that we couldn’t break.

Now as it happens, this is is one of those half-baked Unix things for which Linux has a better replacement. Just like how fork is obsoleted by clone and mmap, it turns out that nice is obsoleted by sched_setscheduler.

In particular, SCHED_IDLE is quite nice, pun intended. If you’re building software (or running other “background” jobs) on your laptop (or a server even), you don’t need a gross patch in the kernel or to set up cgroups manually; just patch your build process to use chrt --idle 0 (or type it manually), and you will notice the difference. If you’re a GNOME developer and use jhbuild, this is actually done by default now.


But what are the *disadvantages*?

June 14, 2011

I’ve been an engineer for quite some time now, and one of the things that’s grown to be a powerful part of my mindset is that making things involves tradeoffs. As Wikipedia says:

If multiple options exist, engineers weigh different design choices on their merits and choose the solution that best matches the requirements. The crucial and unique task of the engineer is to identify, understand, and interpret the constraints on a design in order to produce a successful result. It is usually not enough to build a technically successful product; it must also meet further requirements.

For example, we could build indestructible bridges out of carbon fiber, but then we’d be making one bridge every 10 years[1]. Now, I don’t necessarily expect someone pitching something to me to do this; however, if they do, my trust in what they’re saying increases greatly. It tells me they aren’t just reciting a list of bullet points that were prepared by someone else, but actually know the product. While I haven’t actually tried it, I doubt many of the people at electronics stores could intelligently answer e.g. “Why wouldn’t I want this phone?”.

But so what inspired me to make this post is two things; I was recently during free time looking lightly into both Unhosted and Google Go. Neither for any specific purpose; just to learn.

In particular in the case of Unhosted it was hard to get my head around how the whole thing would work, and while I think I see some potential tradeoffs, it would have saved me a lot of time to see those written down from the perspective of someone who had some experience with the system. I guess I’ll have to read the discussion archives.

[1] I know nothing about civil engineering and made this up.


Analyzing memory use with SystemTap

March 19, 2011

So last year, GLib gained support for SystemTap. I used this for a bit to analyze memory usage in GNOME Shell at the time, but for Fedora 14, we forgot to –enable-systemtap (oops!), and so shipped without the support. This is now fixed in Fedora 15, so we can “out of the box” instrument any GLib program in a variety of ways.

Now, tracing and performance analysis is an extremely complex subject, and there are a ton of different tools out there for Linux. Tracing user space in particular is still an area under active development. But what I want to talk about today is using SystemTap specifically on GLib.

SystemTap is very different from a tool like “strace” that you might use to watch a particular process. It’s a full programming language (and a fairly neat one at that), and it’s global to the system. Now, the static probes that we added to GLib give you easy access to important data from the library. Let’s look at an example.


// gmalloc_watch.stp: Print calls to g_malloc
// Usage: stap ./gmalloc-watch.stp

probe glib.mem_alloc {
printf ("g_malloc: pid=%d n_bytes=%d\n", pid(), n_bytes);
}

Compile and run this with: $ stap -v ./gmalloc_watch.stp. What do you see?


g_malloc: pid=3598 n_bytes=104
g_malloc: pid=3598 n_bytes=68
g_malloc: pid=3598 n_bytes=16
g_malloc: pid=3598 n_bytes=40
g_malloc: pid=3598 n_bytes=40
g_malloc: pid=3598 n_bytes=1
g_malloc: pid=3598 n_bytes=104
g_malloc: pid=3598 n_bytes=104
g_malloc: pid=3598 n_bytes=68
...

All calls to g_malloc from all processes on the system, with very little overhead. This is pretty cool, and it’s just scratching the surface of what we can do. (Note: You will need to add your user to the stapusr group etc. to make the above work; for more documentation see the SystemTap web page linked above).

Okay, so what I wanted was a good way to answer the question “What’s using memory in my GLib program?”. The latest version of my SystemTap script to help answer that is glib-memtrace2.stp. Let’s dive in:

Download, and try: stap -v -c gtk-demo ./glib-memtrace2.stp. Here’s some selections from the output:


$ stap -v -c gtk-demo ~/tmp/glib-memtrace2.stp
Pass 1: parsed user script and 82 library script(s) using 25328virt/16196res/2340shr kb, in 650usr/30sys/696real ms.
Pass 2: analyzed script: 21 probe(s), 5 function(s), 3 embed(s), 9 global(s) using 27328virt/18164res/3360shr kb, in 120usr/500sys/1801real ms.
Pass 3: using cached /home/walters/.systemtap/cache/49/stap_496ad3bd34b95e731521ff2d33066010_13757.c
Pass 4: using cached /home/walters/.systemtap/cache/49/stap_496ad3bd34b95e731521ff2d33066010_13757.ko
Pass 5: starting run.
// glib-memtrace2.stp; target=3703
g_slice: 483652
g_malloc: 578938
GObject GParamObject: 39
GObject GdkDisplayManager: 1
GObject GdkDisplayX11: 1
GObject GParamPointer: 5
GObject GParamDouble: 15
GObject GdkScreenX11: 1
GObject GdkVisual: 32
GObject GtkWindow: 2
# <snip lots of other GObjects>

This is after 5 seconds. What’s it telling me? The gtk-demo process allocated 578938 bytes using g_malloc() in the 5 seconds since it started up. There is also an almost equal number of bytes taken from the slice allocator. Even more interesting, I also have a dump of how many GObjects of which class it allocated. Now, 5 seconds later:


g_slice: 52
g_malloc: -84
GObject GdkPixmapImplX11: 0
GObject GdkPixmap: 0
GObject PangoLayout: 5
# <snip other GObjects>

What it’s printing now is the delta since the earlier statistics. We can see that the g_malloc heap shrank by 84 bytes. The 0 for e.g. GdkPixmap is telling me that one got allocated and freed. Basically, I can interact with apps at nearly full speed and watch in real time how that affects memory usage. Very cool!

I’ve been using this on GNOME 3, and will be checking for memory leaks for the final release. Let’s analyze some parts of the script, so you can understand not only how this script works, but how you can write SystemTap programs.

First of all, I mentioned earlier that SystemTap is global to the system (your programs become kernel modules). Because we only want to trace one process, we need to do this:


if (target() == pid())

The value of target() is set to whatever the process ID of the program we started with -c was (in the case above, remember we used -c gtk-demo).

Second, keeping track of the g_malloc heap is a little tricky; when the function is called, we are told how many bytes it’s allocating, but when the corresponding g_free is called, we don’t know how much is freed! So how did I do it? Basically we model the heap:


global g_heap[65536]
...
probe glib.mem_alloc {
g_malloc_delta += n_bytes
g_heap[mem] = n_bytes
}
probe glib.mem_free {
g_malloc_delta -= g_heap[mem]
delete g_heap[mem]
}

The g_heap variable is an associative array, mapping memory addresses of malloc “chunks” to how big they are. Here you can see a sort of limitation of SystemTap in that things will fail if the process mallocs more than 65536 hunks. These fixed limits are because SystemTap keeps memory in kernel space.

Finally, we set up a timer to print out information every 5 seconds:


probe timer.sec(5) {
printf ("g_slice: %d\n", g_slice_delta);
g_slice_delta = 0;
...

Pretty easy. That’s it for now! Again for more information on SystemTap, check out the web page. For more on the GLib tapset points, see /usr/share/systemtap/tapset/glib.stp and also the gobject.stp. Thanks for reading, and happy memory leak hunting!


Follow

Get every new post delivered to your Inbox.