Carl's boring blog

cairo conferences exa family games git gtk i965 make performance xorg

Here are Carl's most recent blog entries. More information about the blog is available.

A new job, but old performance fixes

Many readers have heard already, but it will be news to some that I recently changed jobs. After just short of 4 years with Red Hat, I've now taken a job working for Intel, (in its Open-source Technology Center). It was hard to leave Red Hat---I have only fond memories of working there, and I will always be grateful to Red Hat for first helping me launch a career out of working on Free Software.

Fortunately, as far as my free-software work is concerned, much of it will be unaffected by the job change. In fact, since I've been looking at X/2D/Intel driver graphics performance for the last year already, this job change should only help me do much more of that. And as far as cairo goes, I'll continue to maintain it, but I haven't been doing much feature development there lately anyway. Instead, the most important thing I feel I could do for cairo now is to continue to improve X 2D performance. And that's an explicit job requirement in my new position. So I think the job change will be neutral to positive for anyone interested in my free-software efforts.

As my first task at Intel, I took the nice HP 2510p laptop I was given on the first day, (which has i965 graphics of course), installed Linux on it, then compiled everything I needed for doing X development. I would have saved myself some pain if I had used these build instructions. I've since repeated that exercise with the instructions, and they work quite well, (though one can save some work by using distribution-provided development packages for many of the dependencies).

Also, since I want to do development with GEM, I built the drm-gem branches of the mesa, drm, and xf86-video-intel modules. That's as simple as doing "git checkout -b drm-gem origin/drm-gem" after the "git clone" of those three modules, (building the master branch of the xserver module is just fine). That seemed to build and run, so I quickly installed it as the X server I'm running regularly. I figured this would be great motivation for myself to fix any bugs I encountered---since they'd impact everything I tried to do.

Well, it didn't take long to find some performance bugs. Just switching workspaces was a rather slow experience---I could literally watch xchat repaint its window with a slow swipe. (Oddly enough, gnome-terminal and iceweasel could redraw similarly-sized windows much more quickly.) And it didn't take much investigation to find the problem since it was something I had found before, a big, blocking call to i830WaitSync in every composite operation. My old favorite, "x11perf -aa10text" was showing only 13,000 glyphs per second.

I had done some work to alleviate that before, and Dave Airlie had continued that until the call was entirely eliminated at one point. That happened on the old "intel-batchbuffer" branch of the driver. Recall that in January Eric and I had been disappointed to report that even after a recent 2x improvement, the intel-batchbuffer branch was only at 109,000 glyphs per second compared to 186,000 for XAA.

Well, that branch had about a dozen, large, unrelated changes in it, and poor Eric Anholt had been stuck with the job of cleaning them up and landing them independently to the master branch, (while also writing a new memory manager and porting the driver to it).

So here was one piece that just hadn't been finished yet. The driver was still just using a single vertex buffer that it allocates upfront---and a tiny buffer---just big enough for a single rectangle for a single composite operation. And so the driver was waiting for each composite operation to finish before reusing the buffer. And the change to GEM had made this problem even more noticeable. And Eric even had a partially-working patch to fix this---simply allocating a much larger vertex buffer and only doing the sync when wrapping around after filling it up. He had just been too busy with other things to get back to this patch. So this was one of those times when it's great to have a fresh new co-worker appear in the next cubicle asking how he could help. I took tested Eric's patch, broke it up into tiny pieces to test them independently, and Eric quickly found what was needed to fix it, (an explicit flush to avoid the hardware caching vertex-buffer entries that would be filled in on future composite calls).

So, with that in place the only thing left to decide was how large of a vertex buffer to allocate upfront. And that gives me an excuse to put in a performance plot:

So the more the better, (obviously), until we get to 256 composite operations fitting into a single buffer. Then we start losing performance. So on the drm-gem branch, this takes performance from 13,000 glyphs/second to 100,000 glyphs/second for a 7.7x speedup. That's a nice improvement for a simple patch, even if the overall performance isn't astounding yet. It is at least fast enough that I can now switch workspaces without getting bored.

So I went ahead and applied these patches to the master branch as well. Interestingly, without any of the drm-gem branches, and even with the i830WaitSync call on every composite operation, things were already much better than in the GEM world. I measured 142,000 glyphs/second before my patch, and 208,000 glyphs/second after the patch. So only a 1.5x speedup there, but for the first time ever I'm actually measuring EXA text rendering that's faster than XAA text rendering. Hurrah!

And really, this is still just getting started. The patch I've described here is still just a bandaid. The real fix is to eliminate the upfront allocation and reuse of buffers. Instead, now that we have a real memory manager, (that's the whole point of GEM), we can allocated buffer objects as needed for vertex buffer, (and for surface state objects, etc.). That's the work I'll do next and it should let us finally see some of the benefits of GEM. Or if not, it will point out some of the remaining issues in GEM and we'll fix those right up. Either way, performance should just keep getting better and better.

Stay tuned for more from me, and look forward to faster performance from every Intel graphics driver release.

Posted Tue Jul 15 15:21:51 2008 Tags:
A chain of bugs

With cairo's recent 1.6.4 release, we've hoped to reach the nirvana of applications that display and print documents with perfect fidelity. Unfortunately, reality isn't always as pleasant as we would like. I recently received a bug report that Firefox 3 (using cairo 1.6.4) resulted in a blurry mess when printing a very simple web page, (some text, a table, and an image). Exploring the details of this case reveals at least three independent problems that conspire to give the bad results.

Bug 1: Firefox+cairo uses image fallbacks for table borders

First, here's the simplest web page I was able to construct to show the problem, (nothing more than a single-cell table with a border): bug.html (122 bytes).

Using Firefox3 with cairo 1.6.4 on a Fedora9 system, I did a "print to file" and obtained the following PDF output: bug.pdf (14,465 bytes).

This output is still quite accurate and fairly usable. But we've already seen problem #1. Note that the file size has increased by a factor of 100 compared to the original HTML. The PDF does have more content, (firefox adds a header and footer for example), but nothing that explains such a large file. Instead, something about the way that firefox is expressing the table border is resulting in cairo putting fallback images into the resulting PDF file. So that's the first bug. I'll look closer at this, (probably with libcairowrap), and make a bug report to the mozilla folks if necessary.

Also, note that when cairo puts the fallback images into the PDF file it uses a "knockout group" to do so. This is a particular PDF construct that I'll discuss later.

Bug 2: Poppler+cairo expands knockout groups to full-page fallbacks

Next, we can use the poppler library, (with evince or a pdf2ps utility), to read the PDF file and use cairo to generate a PostScript file: bug.ps (138,067 bytes).

Notice that there has been another factor of 10 increase in the file size. Here, poppler has convinced cairo to generate a full-page fallback image rather than just the minimal fallback images present in the PDF file. This is due to the way poppler is handling the knockout group and really comes down to the difficulty of getting a single, desired result to pass through two systems with very different rendering models.

To explain a bit, (but ignoring many gory details), a PDF knockout group can be a very complicated thing, so poppler has some fairly sophisticated code to handle these. This support involves rendering everything in the group twice and then using cairo's DEST_OUT and ADD compositing operators to properly combine them. Well, PostScript can't do fancy compositing like DEST_OUT and ADD, so of course cairo falls back to image-based rendering for things. The irony here is that the only reason cairo is using a knockout group in the original PDF file is to prevent any compositing from happening, (the fallback image needs to replace any "native" content that might appear below it). And it turns out that painting an image without any compositing is the only kind of image painting that PostScript knows how to do.

So, cairo is using an advanced feature of PDF to describe precisely the semantic that PostScript supports natively. The change we need is to fix poppler to recognize this case and ask for the simple thing from cairo's PostScript backend so that we don't get this full-page fallback explosion.

Bug 3: Cairo uses the wrong resolution for fallback images (in groups)

If it were only for those first two bugs, the intermediate file sizes would have been larger than normal, but the final result would have looked great and printed just fine. And in that case, I probably would have never even received a bug report.

But there's a third problem that is the most pernicious, because it results in the final result looking just awful. When cairo inserts the full-page fallback into the final PostScript file, it is inserting it at 300dpi, but it does that only after rendering it to an intermediate 72dpi image, which is then scaled up. That's why the final PostScript file appears so blurry and hard to read.

This third problem is the first I attempted to fix, (since it involves cairo alone), and I described my attempts in several posts to the cairo mailing list over the past couple of days, beginning here:

Bug with fallback resolution of groups

In that series of posts I coded a minimal test case in cairo's test suite for the resolution problem, and a patch that fixes that test case. But when I use a patched cairo for the PDF to PostScript conversion of the file described here, I end up with the following result: bug-patched-cairo.ps.

Here, there's still a giant, full-page fallback image, (this is expected since I haven't touched poppler yet). And the image is at least rendered at the correct resolution this time, (notice that the text that appears is much more sharp than in the previous PostScript file). However, the original HTML table is now entirely missing. So there's definitely something wrong with my patch.

I'll continue to chase these bugs down. The interesting thing about this chain is that it's only as strong as its weakest link. Fixing any of the individual problems here will make the end-to-end behavior be quite acceptable.

And I'll continue my quest to get high-quality display and print output from cairo-using applications. It can be a challenging goal, but it's also a lot of fun and very rewarding. Please feel free to jump in and help if you're interested.

Posted Thu May 22 12:00:55 2008 Tags:
GTK+ Hackfest day #2

I have the opportunity to be attending a GTK+ Hackfest in Berlin this week. Visiting Berlin, (and sleeping about a block away from Checkpoint Charlie), is quite interesting, and a welcome new experience for me.

I promised some people back home that I would post daily updates on things that get discussed/decided here. The hackfest could only support a couple dozen people, but certainly there's a larger audience interested in the things happening here. And it would be a shame to not share things with that larger audience.

So far I've done a very poor job of actually getting those updates out. One snag was that due to some interaction of bugs in ikiwiki and/or planet, planet.gnome.org decided to post ancient posts of mine instead of recent ones. I still haven't worked out those bugs completely. In the meantime, I'm glad that some people that missed it the first time enjoyed my writeup on learning git. And for anyone wondering about my son Scott, it's been over 6 months since his first steps, and he's now walking, running, and doing everything else expected of a 2-year-old. Anyone who met him for the first time now would never know that he ever had any problem. So that's wonderful.

But getting back to the hackfest: Due to my travel schedule I missed the first day completely. I know that some transitioning-to-GTK+-3.0 plans were discussed then, and hopefully people have seen those slides. There was also a lot of talk about adding introspection to GTK+, (to enable automatic language bindings and other fancy things). The introspection discussion has continued over the second and third days, and hopefully it's going just great, (but I haven't been a part of it at all---somehow it's just one of those topics that makes my brain switch off---but I'm glad other people are interested in solving the issues).

So what I did do on Tuesday included a fantastic presentation from Behdad Esfahbod about the proposed "user font" API for cairo. This is something that's had some preliminary patches since 2006 when Kristian Høgsberg first needed this for supporting fonts embedded in PDF files. The discussion here took great advantage of having all the right people together at the same time, (though it would have been great to have Kristian here as well). We had an implementor (Behdad), a maintainer (myself), and two consumers of the API, (Benjamin "Company" Otte of swfdec fame and Alp Toker of WebKit/GTK+. That was a great group to have together to make sure we the API would make everybody happy. We found several changes that will improve the API quite a bit, but also shouldn't be too much work. Behdad plans to have the changes all committed to cairo, (on a new 1.7 development branch), before the end of the week.

Afterwards, Behdad returned my code-review favor by taking a look at the patch series I have for adding support for arbitrary true-color visuals to cairo. This series is to address a number of fatal bugs that occur when targeting X servers that don't have the Render extension, (Xvnc is a common case). I knew my patch series had some ugly bits in it, but I wasn't sure what the cleanest fix would be. Behdad set me straight right away. So I'll be landing this in cairo very shortly, and this is one of the very last issues left unfinished on the cairo 1.6 roadmap.

The only other unfinished feature on the roadmap is support for pseudo-color visuals, but Behdad is trying to convince me to let that slip, (it's a lot of work for fairly small gain). Who would be really hurt to see this feature slip again? Let me know if you are, (and I'll let you know how you can help make it happen too).

Posted Thu Mar 13 16:53:20 2008 Tags:
LCA 2008 Update on EXA/i965

I'm definitely overdue as far as posting an update on the progress of the work we've been doing to improve EXA performance for the i965 driver. And just yesterday, Benjamin Otte pointed out to me that it's really hard for many people to get any understanding at all about some of the work that's going on within the X.org development community.

Part of my reply to Benjamin was that there were a lot of excellent talks given at LCA this year, (Keith Packard, Dave Airlie, Adam Jackson, Jesse Barnes, Peter Hutterer, Eric Anholt, and myself were all there talking about X in one way or another). And that is true, but it's also true that many people were not able to attend LCA to hear those talks. And while the LCA conference kindly posts video of the talks that's not always the most desirable way of getting information when not at the conference in person.

So I think it would be fair to say that we've been doing a poor job of providing easy-to-find information about what's going on with X. I definitely want to help improve that, and I even just got an official designation to do exactly that. I was recently elected to the X.org Board of Directors and also assigned to chair a Communications committee whose job it is to help X.org communicate more effectively. What can we do better? Please email me with your ideas.

In the meantime, for my own part, I've just done a fairly thorough writeup of my LCA talk. That's something I've been wanting to get in the habit of doing for a while. One thing I can't stand is reading presentation slides that are almost content free---where clearly they weren't meant to stand alone but were meant to be accompanied by someone speaking for up to an hour. And I know I've been guilty of posting slides like that before. So this time, I've written some text that should stand alone quite well, (though, since I just wrote it today it might not correlate extremely well with what I said that day at LCA---but I've tried to address the same themes at least).

Posted Wed Mar 12 09:33:26 2008 Tags:
A first look at Glucose on the i965

As readers of my blog know, I've been working on improving the Intel 965 driver within the EXA acceleration architecture of the X server. Meanwhile, there's an alternate acceleration architecture originally announced by Zack Rusin in August 2006 called Glucose. The idea with Glucose is to accelerate X rendering operations by using OpenGL.

Recently there's been a fair amount of activity in the various Glucose branches, so I thought I'd take a look to see how well it's working. This was quite convenient for me as the current Glucose work is focused only on Intel cards. Since Glucose isn't quite ready for prime-time yet, it does require fetching various non-master branches of several git repositories. It's not always obvious which branches to take, so José Fonseca was kind enough to writeup some Glucose build instructions.

I've followed those instructions and run a benchmark comparing Glucose and EXA. The benchmark I chose is the expedite application that comes with evas, (thanks to the people that kindly pointed out this newer benchmark to me after my recent explorations with the older evas benchmarks). To get expedite you'll need the e17/libs/evas and e17/apps/expedite modules from enlightenment CVS.

Expedite is a nice benchmark in that it separates things like image blending and text rendering into separate tests, (unlike the older evas benchmark). It's also nice that evas includes many backends which can be interesting for comparison. But I won't be looking at anything but its XRender-based backends here---and it looks like evas' cairo and OpenGL backends are not currently functional. They are disabled by default, and when I enabled them I ran into compilation problems, (I suspect neglect and bit rot).

So here are the results I got for three acceleration architectures:

  1. XAA with the XAANoOffscreenPixmaps option, (this is an all-software implementation for reference).

  2. Glucose---the new thing we're looking at here.

  3. EXA, (as available in the various glucose branches---so without things like my recent glyph-pixmaps work).

The results are all normalized to the performance of our baseline, XAA. And larger numbers are better.

The raw data is also available for anyone with particular interest, (it has non-normalized values as well as results from evas' software backends using both SDL and X11 for image transport).

The quick conclusion is that, so far, I'm not getting any benefit from running Glucose as compared to just running an all-software implementation, (see how all Glucose and XAA bars are basically identical). I might still not have everything compiled or running correctly yet, but I'm quite sure that at least some Glucose code was active during my Glucose run. That's because a glucose module failed to find a required symbol and crashed during the "Polygon Blend" test, (which is why it doesn't have a bar in the chart, nor is there a number for the overall "EVAS SPEED" result for Glucose).

Meanwhile, it's also clear that EXA is going far too slow for text operations. This isn't any surprise since I've documented problems with slow text rendering on the i965 several times before. However, I've never before measured text rendering that's quite this slow. I'm seeing speeds of only about 30k glyphs/sec. with EXA on these branches, while my previous testing always showed about 100k glyphs/sec. I suspect that there's been some regression somewhere in the X server or the Intel driver, (and likely unrelated to anything about Glucose---Keith has reported similar slowness with builds from the master branches).

Another interesting thing to look at is the difference caused by the "few" variants of the "Rect Blend" and "Rect Solid" tests. When going from the non-few to the "few" variants, both Glucose and EXA slow down significantly. I'm quite happy to see these tests in this benchmark since it is often the setup overhead that kills you when trying to accelerate a small number of operations, (and applications have a tendency to want to do that very often). Many synthetic benchmarks are extremely non-useful in hiding this overhead by doing huge batches of operations.

Meanwhile, I'm still working on fixing the horribly slow compositing problems in the i965 driver that are keeping text so slow. Keith Packard and Eric Anholt are helping me debug my code, and hopefully we'll soon have something to show which runs at a reasonable speed.

Posted Fri Oct 19 14:47:53 2007 Tags:
Git is easy to learn

For a while now I've had to listen to people tell me that git is somehow more complicated than other distributed source code management tools. A very recent example was what Robert O'Callahan said about a month ago talking about mercurial:

Best of all the underlying model seems quite simple and I think I understand it having read a few chapters of hgbook. I used git with cairo without ever really understanding what git was doing.

This really confused me because I think I understand both git and mercurial fairly well, and I just don't see big user-visible differences in the "model" that they offer---particularly for new users. The model of each is basically identical. There are low-level implementation details. And while some argue that repository formats matter that kind of low-level detail can have no bearing on the user.

So I went and took a look at the book he referred to, (Distributed revision control with Mercurial by Bryan O'Sullivan). I read Chapter 2: A tour of Mercurial: the basics and found it to be quite well-written and a very gentle introduction. But I also noticed that almost all of the text there could apply directly to git. When I mentioned that to roc, he said, "In that case, someone should write that text for git", and he said the Git User's Manual didn't fit with him like this chapter did in terms of the order of introducing concepts and the level of detail used.

So I thought it might be an interesting project to "port" this chapter from mercurial to git. The book is freely licensed, (it's distributed under the Open Publication License which I found several problems with, but I'll save those for another day), so it would technically be possible. But I was also aware that the original author might not like the idea much. I talked to Bryan a bit about it and he admitted that my idea would make him "somewhat sad" but that I was obviously free to do what I wanted with the text within the license.

Well, I put the project off for a month because I think the feelings and desires of authors are important, in spite of what the license makes possible. And I definitely would not want to give the impression that the git community would want to leech off of the tremendous amount of hard work that Bryan has put into authoring this text.

But I also keep hearing more of this "git is hard to learn" and I felt that something needed to be done. So hopefully what I did last night won't offend Bryan too much. I have ported chapter 2 to git and the result is A tour of git: the basics.

This is not in any way an attempt to port the entire book. It's just the one chapter, and it's lost all its LaTeX goodness, (navigation, cross-referencing, PDF generation, etc.), and is just a static blog-post. I don't think it would even be a good idea technically to try to port the entire book, (even if were a good idea socially).

Instead, I really just wanted to attempt a demonstration that similar "easy to learn" text could describe git as well as it describes mercurial. And in fact, the exercise did point out several cosmetic deficiencies in git, (which can definitely impact the new-learner experience), and I've marked several of those with XXX comments in the text. It would be great to see many of these get fixed right away.

And as a route going forward for getting more complete, high-quality documentation for git, like I said, I don't think it would be wise to try to convert an entire book like Bryan's. But it might very well be a good idea to get some organizational ideas from a book like that to guide how to put together the Git User's Manual.

(And something I should do at some point is to read/edit the Git User's Manual. To be completely honest, I've never read it).

Anyway, I hope the text I have created is at least useful for somebody. And regardless of what your favorite distributed source-control tool is, just stay away from cvs and svn!

Posted Fri Sep 28 15:48:49 2007 Tags:
Running render_bench against EXA/i965

Earlier this month I attended the X Developers' Summit in Cambridge, UK (not the Cambridge near Boston, USA). We stayed at Clare College which, like all of the University of Cambridge colleges that I saw, is immaculately well-kept and quite beautiful. Just look at the gardens I walked past every day to get from my room to the conference room in the library. Kudos to the X.Org foundation for arranging such a beautiful site, (I think Daniel Stone and Matthew Garrett deserve particular thanks), and for providing travel expenses so I could attend.

Adam "ajax" Jackson was kind enough to write up some notes on my talk and the other talks as well. I haven't posted slides from the talk, but it really wasn't much more than a condensed version of exa-related blog entries I've made, (and which are linked to in Adam's writeup).

One of the things I asked for in the talk is more benchmarks for 2D rendering---in particular real-world applications with benchmarking modes and micro-benchmarks distilled from real-world applications. Vincent Torri recently reminded me that Carsten "rasterman" Haitzler wrote render_bench a long time ago precisely to measure the performance of XRender, (and to compare it to his imlib2 software).

I hadn't run render_bench since I started playing with EXA and the i965 chip, so it was definitely a worthwhile thing to do. Here are the results I got (comparing XAA and EXA both against imlib2):

All of the numbers are from the same 2.13GHz dual-core Intel machine. But the absolute numbers aren't interesting anyway. The interesting part is the huge improvement in X Render performance going from XAA to EXA for the i965 device. It goes from 2-8 times slower than imlib2 to 1.3-12.9 times faster. Anyone interested in the raw times can view the EXA log and XAA log files.

One thing that would be useful is for someone to augment the framework to also test the same drawing operations through cairo. It would be good to verify that none of the cairo software layers get in the way of this performance, (I can imagine cairo doing something like setting up and tearing down XRender Picture objects rather than reusing them, but hopefully it will perform just as well).

And I should point out that this improvement is not due to anything I've done. This is basically just an upstream xserver tree, (it might have my glyph-pixmaps change but they are not relevant here). So kudos to the EXA hackers I mentioned in my talk, (Keith Packard, Zack Rusin, Eric Anholt, and Michel Dänzer). I definitely need to amend my what EXA gets right post to add image-scaling to window-copying and solid-fills.

This also isn't with any special hacks to the xf86-driver-intel source, (I'm using upstream commit 286f5df0b from Sep. 6). This benchmark clearly isn't hitting the same compositing slowness I'm seeing with glyph rendering and that might be because it's using larger images than the generally tiny images that are used for glyphs, (but I'm just guessing---I haven't looked closely).

Meanwhile, I am rewriting the driver to eliminate all the syncs and flushes when compositing to fix the glyph performance. I hope to have something worth sharing soon.

Finally, I also compared the results of evas_xrender_x11_test with evas_software_x11_test. This is similar to the original render_bench, but with a more real-world framework in place, (the evas canvas), as opposed to just a micro benchmark. Here XRender/EXA did not fare as well, scoring an evas benchmark score of 4.994 compared to the 10.418 of the software version. (Meanwhile XAA scored 4.840 but with some noticeably incorrect results---the large scaled image came out just black). The weaker performance here might very well be because the evas tests do include text which render_bench does not, (but again I'm just guessing and haven't looked closely).

Oh, and the evas snapshot I used for this test is evas-0.9.9.023. I tried to also test a newer snapshot such as evas-0.9.9.041, but it seems to not build the evas_*_test programs anymore. Perhaps they're now available separately?

Posted Thu Sep 27 12:44:50 2007 Tags:
Eliminating glyph fallbacks

Sometimes things get worse before they get better.

A few days ago, I presented a patch for storing glyphs as pixmaps which improved performance, but not as dramatically as one would have hoped.

I profiled the result and found that there were still a lot of software fallbacks going on. Tracking things down, (hints: enable DEBUG_TRACE_FALL in xserver/exa/exa_priv.h and I830DEBUG in xf86-video-intel/src/i830.h), I found a simple case statement that was falling back to software for any compositing operation targeting an A8 buffer. Fortunately, it looks like this fallback was due to a limitation in older graphics card that doesn't exist on the i965. So a very simple patch eliminates the software fallback.

So lets take a look at before-and-after profiles:

aa10text-fallbacks/ (144000 chars./sec.) symbols profile
aa10text-no-fallbacks/ (95000 chars./sec.) symbols profile

Yikes! The patch takes us from 144k chars/sec. to only 95k chars/sec. I'm regressing performance! But look again, and see that the libexa time has been cut dramatically, and the libpixman time has been eliminated altogether. That's exactly what we would hope to see for eliminating software fallbacks. So I've finally gotten this text-rendering benchmark to involve no software fallbacks. Hurrah!

Meanwhile, the intel_drv and vmlinux time have increased dramatically. Take a look at how hot those hotspots are in their profiles:

intel_drv:

samples  %        symbol name
29614    41.2170  i965_prepare_composite
26641    37.0792  I830WaitLpRing
9143     12.7253  i965_composite
1618      2.2519  I830Sync

vmlinux:

samples  %        symbol name
28775    25.3748  delay_tsc
21956    19.3616  system_call
7535      6.6446  getnstimeofday
5109      4.5053  schedule

So this is just the same, old synchronous compositing bug I identified earlier. Performance has gotten worse since I'm stressing out the driver and this bug more.

Dave Airlie has been doing some recent work that should let us fix that bug once and for all. Hopefully it won't be too long before I can actually post some positive progress here.

PS. I've also gotten one report that my patch for storing glyphs as Pixmaps speeds glyph rendering up initially, but after the X server has been running for about an hour or so, things get really slow. Shame on me for not doing any testing more extensive than starting the X server and then running a single client for a few minutes, (either firefox or x11perf). The report is that most of the time is disappearing into ExaOffscreenMarkUsed. Well the good news is that Dave's work eliminates that function entirely, (along with lots of migration code in EXA), so hopefully there's not any big problem to fix there. I'll have to test more thoroughly after synching up with Dave.

Posted Tue Aug 7 18:13:22 2007 Tags:
Storing glyphs as Pixmaps

A few months ago I reached the conclusion that remaining cairo performance problems were largely not in the cairo library itself, but were in the X server, its acceleration architectures, or in the X drivers for specific devices. So I started measuring with the cairo-perf suite of micro-benchmarks and I identified what appeared to be some potential problems. For example, there are OVER operations that should degenerate into simple blits but that seem to be running 2x slower than blits, (more on this later).

Before pursuing those in detail, (or after chasing a non-problem for too long), I decided to step back from micro benchmarks and instead look at some real-world tests with the Mozilla Trender suite to ensure I wasn't doing micro-optimization that wouldn't have any significant impact. It was at that time that I also switched my focus from an ATI r100, (which was just the graphics chip that happened to be in my laptop), to looking at the Intel 965 chip instead, (since Intel had donated one for me to work with).

The i965 is interesting because it's new, (ooh, shiny!), and coming from a company that actually supports the free software community by providing free software drivers. That support continues to improve as last week, Intel made technical documentation on the i965 available to myself and other Red Hat employees. (The documentation was made available under an existing Intel-Red Hat NDA which means I cannot share the documentation, but I can use the documentation to write, improve, and release free-software drivers.) I'm optimistic that Intel will be willing to setup a similar NDA with anyone interested in improving the drivers, and even better, that Intel will eventually convince itself it can share the documentation as freely as it is currently sharing its driver source code.

And actually, the work I've done in the last week hasn't strictly required the documentation at all. What has been necessary is to roll up my sleeves and get more familiar with the X server source code. I'm really grateful to Keith Packard, Eric Anholt, Dave Airlie, Kevin Martin, Michel Dänzer, Adam Jackson, Daniel Stone and others who have helped me get started here. There's really a very welcoming community of very intelligent people around the X server who are glad to help guide new people who want to help. And there's no shortage of things that can be done.

It is a large code base to get familiar with, (using "git grep" to find things helps a lot). And, being as old as it is, it does have lots of "moldy" aspects to the way it's coded, but it's not as bad as one might fear. So please, come join us if you're interested!

Guided by the problems showcased by the Mozilla test suite and the i965 driver, I decided that the most obviously underperforming operation is glyph compositing. And I also identified two underlying problems: excessive migration and synchronous compositing.

With the problems identified that concretely, I'm actually working on fixing problems now instead of just reporting them. And for this focused work, it makes sense to get back to micro-benchmarks for tracking the specific things I'm working on. So I started out with "x11perf -aa10text" to test glyph compositing performance. A more general operation than glyph compositing is image compositing, but it seems that x11perf has never acquired any Render-based image compositing benchmarks, (maybe that explains why some compositing performance regressions went unnoticed?). I did convince Keith to sit down and write some x11perf-based compositing tests, which I expect he'll push out shortly. And those tests should do a great job of highlighting the problems I seemed to see with cairo-perf where compositing with Over wasn't properly degenerating to blit performance when there is no source alpha.

In exchange, Keith convinced me to do some work to change the way glyph images are stored in the X server. Previously, glyph images have been chunks of system memory, which means they were off-limits for being used as part of any accelerated rendering. What EXA would do, is every time a glyph was to be rendered, it would first copy it into a video-memory Pixmap so that it could have some hope of accelerating it. So the same glyph data would get copied from system memory to video memory over and over again, (and likely overwhelm any performance advantage from doing "accelerated" compositing with the glyphs).

A fairly obvious solution is to move the canonical location for glyph data to be video-memory Pixmaps in the first place. This has a few potential problems:

  1. Glyph images are sharable across the entire server, but Pixmaps are specific to each individual "screen" within the server.

  2. The X server uses the system-memory glyph data to compare when a glyph is uploaded by a client that is identical to a glyph uploaded previously by another client, (using a simple XOR-based hash to do fewer comparisons---but always falling back to a full compare after matching the hash).

  3. Recent work that Dave Airlie, Kristian Høgsberg, and Eric Anholt have been doing may result in there being a one-to-one relationship between Pixmaps and "buffer objects". And these buffer objects require page-alignment, so their minimal size will be 4k, (which could be quite excessive for small, 10x10 glyph).

Another concern before any of those is whether glyphs are even worth trying to accelerate in the first place. If they are small enough, might the overhead of involving the GPU be excessive and it would be better to simply let the CPU render them, (even if that requires some read-modify-write for the compositing)? For this concern, see the window-to-window copy results I just posted in what exa gets right. That shows that EXA (GPU based) copying can be 5x faster that NoAccel (CPU based) even with regions as small as 10x10. Add compositing to that, and the GPU should be just as fast, but the CPU should be slower. So we really should be able to win, even with fairly tiny glyphs.

So, how to tackle the other technical problems. Here's what I've come up with so far:

  1. Per-screen Pixmaps: Suck it up for now. One, actually having multiple "screens" in the X server isn't common. Things like Xinerama that use one "screen" for multiple displays are much more common. So, I've written code that allocates one Pixmap per screen for every glyph. If this turns out to be a problem in practice, it would be quite trivial to create the Pixmaps lazily for all but one screen. And it would also be worthwhile, (but a much larger change), to lift the per-screen restriction for objects like Pixmaps.

  2. System-memory data for avoiding hash collisions: The goal is to move the storage from system memory to a video-memory Pixmap. We lose, (by spending excess memory), if we still have to keep a system-memory image. To fix this, I've replaced the weak XOR-based hash with a cryptographically strong hash (SHA1) that will be (probabilistically) collision free. This does introduce a new dependency of the X server on the openssl library.

  3. 4k alignment constraints for buffer objects: This is likely a very real issue, but something I'd like to address later. Presumably we can alleviate the problem by pooling multiple glyphs within a single Pixmap, (or multiple Pixmaps within a single buffer object), or whatever necessary.

So, given those approaches, I've written a series of 7 patches implementing glyph storage as pixmaps.

On my i965 the patch series doesn't impact NoAccel or XAA performance considerably, but does improve EXA performance a bit. Here are the results for x11perf -aa10text and -aa24text:

glyph-pixmaps-aa10.png

glyph-pixmaps-aa24.png

Now, that was quite a bit of work, (and way too long of a blog post already), but not yet any huge performance improvement. But I think this is a good, and necessary step toward getting to fast compositing of glyphs. Here are before-and-after performance charts for the aa10text test with links to profiles:

EXA-aa10text-before/ symbols profile
EXA-aa10text-glyph-pixmaps/ symbols profile

We can see that some copying was eliminated, (note the fbBlt contribution to libfb in the before profile has disappeared completely). But there's still some migration going on somewhere, (see the exaMemCpyBox stuff as well as a bunch of software rendering happening in pixman). The assumption I'm operating on is that we should be able to eliminate migration and software rendering entirely. The hardware is very capable, very flexible and programmable, and we have all the programming documentation. So there should just be a little work here to see what's still falling back to software and eliminating it.

Then, obviously, there's still the synchronous compositing problem. I'm guessing that's where the big time spent in the kernel is coming from. So imagine half of the pixman and kernel chunks going away, along with 25% of the libexa chunk and over half of libc, (that looks like the obvious hotspots from excessive migration synchronous compositing). And then EXA text would at least catch up to XAA and NoAccel.

But if we only match the performance, we're wasting our time and should just use the NoAccel code paths in the first place. But I'm optimistic that there's still quite a bit of optimization that could happen after that. We'll see of course.

Posted Fri Aug 3 17:45:38 2007 Tags:
What EXA gets right already

I've been writing various posts about EXA for a couple of months now. And for the most part, they've been fairly negative, (showing big slowdowns compared to running an X server without acceleration at all, for example).

As I've talked to people that have read the posts, it's clear that I've managed to spread some misconceptions. So let me clear things up now:

The reason my posts have focused on negative performance aspects is because I was looking for things that could be sped up, and as is only appropriate I looked for, found, and have been focused on the biggest performance problem with EXA I could find, (which turns out to be glyph rendering).

So, briefly here, I want to mention a couple of things that EXA is doing a fine job with. The first is the big reason why you don't want to run an X server with NoAccel: scrolling will hurt very badly. Take a look at these rates for a window-to-window copy of a rectangle of various sizes. These results are from "x11perf -copywinwinX" and multiplied by the number of pixels in each operation.

[All tests here are with very recent checkouts of xserver, mesa, and xf86-video-intel. Tests are run on an Intel Core 2 CPU @ 2.13GHz with an Intel 965 graphics card. Thanks, Intel for the donation of hardware for this testing!]

window-copy.png

Window-to-window copy performance (Millions of pixels/sec.)
Rectangle size 10x10 100x100 500x500
NoAccel 14.2 26.5 23.475
XAA 57.8 438 587.5
EXA 77.6 464 587.5

So here we can see that EXA is from 5 to 25 times faster for scrolling windows, depending on the size. And I can assure you that you definitely don't want windows to start scrolling 25 times slower (chug, chug, chug). Meanwhile, EXA is marginally faster than XAA on this test, but not significantly.

Second, let's look at another common operation, filling solid rectangles. This is an essential step in almost any rending, (for clearing to the background color), as well as for actually rendering some content. These results are from "x11perf -rectX", again multiplied by the number of pixels in each operation.

rectangle-fill.png

Solid rectangle fill rate (Millions of pixels/sec.)
Rectangle size 1x1 10x10 100x100 500x500
NoAccel 10.5 99.6 392 662.5
XAA 1.5 90.9 698 842.5
EXA 2.5 250 1150 847.5

Again, EXA outperforms NoAccel here (from 1.3x to 2.9x faster), for all but the tiniest of rectangles. Interestingly, EXA also outperforms XAA by up to 2.7x for the 10x10 rectangle. Also, it's quite interesting to note, (and it's hard to see on the bar chart), that NoAccel outperforms EXA (4.2x) and XAA (7x) for the case of a 1x1 rectangle. Presumably the overhead of setting up hardware rendering for a single-pixel object just plain isn't worth it, (which really shouldn't be that surprising).

So those are a couple of the operations where EXA and XAA are already performing quite well. Some of you will note that Keith Packard has often joked that an X server doesn't need more acceleration than these two operations to perform well. And if you look at the whole set of operations in the XAA interface, indeed you'll find many there that modern applications won't use at all.

But meanwhile, applications are now using the Render extension more and more extensively to draw things. And this is where EXA should afford some acceleration possibilities that XAA does not. And this is also where I've been identifying several problems. If "copy" and "solid fill" are the two most fundamental operations, maybe the next two are "compositing" and "compositing glyphs". I've been talking about problems in those operations for a while, and I plan to start talking about actual solutions soon.

Stay tuned.

Posted Fri Aug 3 15:59:55 2007 Tags: