Why GPUs are not (so) good at post processing images

Now I know the title of this post may sound controversial, but keep on reading, you might even end up agreeing with me.
Post processing effects are often implemented via convolutions, reductions and more sophisticated filters such as bilateral filters. What is interesting here is that the vast majority of these techniques all have something in common: data needed to work on a pixel are often required to work on neighboring pixels as well. This is an important property which is usually referred as ‘data locality’ and GPUs exploit it through texture caches.

Pixel shaders programming model has been designed to enable maximum parallelism. Each pixel is processed independently from any other pixel and no intercommunication is possible within a rendering pass. This model allows to pack an high number of processors on a single chip without having to worry about their mutual interactions, as no data synchronization between them is necessary..ever! This is also why the only atomic operations implicitly supported by current GPUs are related to alpha blending.

While this model has been incredibly successful it’s not exactly optimal when we would like to share data and computations across many pixels. As a practical example we can consider a simple NxN separable box filter which can be implemented on a GPU in two passes; each pass requires to fetch N samples and perform N-1 MADDs (or N-1 ADDs and one MUL) in order to compute a filtered pixel. This way to process data is clearly sub-optimal as while we move over the image horizontally or vertically each not-yet-filtered pixel shares N-1 neighboring pixels with the previously filtered pixel. Why wouldn’t we want to re-use those informations? Unfortunately we can’t, on a GPU..

..On the other hand modern multi-core CPUs can easily outperform GPUs in this case. Each core could be assigned to not overlapping areas of the image, moreover a pixel can accumulate its filtered value in a register and pass it to the next pixel.
At each iteration we wouldn’t need to fetch N samples anymore, we can in fact re-use the value generated by the previous iteration. As the filter kernel moves horizontally or vertically over the image at each new step an old pixel as to be removed from the accumulator and a new one can be added to it. Like in a fifo buffer we remove the first entry in the fifo and we add a new one (the fifo can be managed via registers rotations or using a modulo N-1 pointer to an array of values). At each step only one new value has to be read from memory (two if we implement it with a rotating pointer) and only one value has to be written to memory (or two again..). An addition, a subtraction and a multiplication would be needed per iteration to filter a pixel, no matter how wide our box filter is! More complex separable filters as tent or gaussian filters can be easily implemented multi-passing the work done on a column or a row of the image two or three times; for large filter kernels this is going to be incredibly fast as again the computation time does not depend by the filter width anymore

On a processor like CELL a single SPU would probably be able to perform such a task on single precision pixels at 4-5 cycles per pixel with fully pipelined and unrolled loops (data transfer from and to external memory are handled via asynchronous DMA operations). 6 SPUs could apply a box filter with ANY kernel size (even the whole screen!) on a 720p image over 40 times per frame, at 60 frames per second, and consuming only a relatively small amount of memory bandwidth as most read and write operations would happen on chip. This is admittedly a best case scenario, something slightly more complex would be necessary to also have a very good accuracy on the filtered data, but it’s clear that an architecture like CELL is very well versed in this kind of data processing. I hope to see more and more games in the future implementing their full post processing pipeline on CELL and off loading this way a ton of work from RSX that could then spend more time doing what it does best (e.g. shading pixels which don’t share a great amount of data and computations!)

It’s interesting to note that the last GPU architecture from NVIDIA (G8x) was also designed to be more efficient at tasks as image post processing. For the first time developers can leverage on a small on chip memory which is shared by a cluster of ALUs (each G8x GPU has a variable number of ALUs clusters..), each cluster has its own shared on chip memory which is split over multiple banks. As each memory bank can only serve one ALU per clock cycle it’s not certainly straightforward to make a group of processors use this memory without incurring in memory stalls or synchronization issues, nonetheless it looks we are at least going toward the right direction. Unfortunately at the time I’m writing this I’m not aware of any way to use these capabilities via some graphics API as D3D or OpenGL.

So.. do you still think GPUs rock at image post processing?;)
Feel free to post your own take on the subject, rants and errata corrige.

Advertisements

13 Responses to “Why GPUs are not (so) good at post processing images”

  1. Andrew Lauritzen Says:

    Definitely an interesting topic, and one that comes up quite often at work (RapidMind) ๐Ÿ™‚

    The fundamental idea that I think you’re getting at is one of different memory architectures, in particular comparing the “scratch pad” idea (fast, local, software-managed memories) to hardware-managed caches. It’s certainly easy to pick an algorithm that seems to benefit a lot from software knowledge of memory access patterns (like convolution).

    (Note that I’m intentionally leaving redundant computation out of this conversation because convolutions are the poster child of memory bound operations.)

    I’d generally agree that it’s nice to have control over how memory is managed (such as on Cell, and in CUDA/CTM) and like most people would think that this ability would make a huge difference to performance in many algorithms, convolution being a great example. However, I was speaking to someone (at I3D this year I believe) who is studying exactly this topic (efficient of various memory architectures, particularly Cell -like vs GPU-like vs CPU-like) and he mentioned that contrary to this popular belief, hardware-managed associative caches do almost as well in practice, and better in some cases!

    So after speaking at some length with him and thinking more on the problem, I’m no longer convinced that software-managed memory hierarchies are necessarily the holy grail that people have made them out to be. In particular they’re pretty ugly from a programming point-of-view (especially once you need to start coding for multiple cache levels), although they can actually be fairly nicely abstracted with a nested program kernels (generalized program “recursion”).

    In any case it’s not as bad as you make it out to be… firstly the “low-level APIs” from both vendors (CUDA and CTM/CAL) allow a much more flexible use of memory. G80 actually has small local memories, but even on R600 you can read in blocks of pixel data, operate on all of them (sequentially in one thread) and spit out all of the output data quite efficiently. The latter method is pretty much exactly what you do on Cell anyways, although rather than a block DMA function you get hardware-managed caching and write coalescing. Arguably GPUs already lay out the data in a much more efficient way for this sort of operation (tiles) than the naive Cell implementation, so there’s a free benefit there as well.

    In any case, the various tradeoffs are discussed at length in NVIDIA’s separable convolution in CUDA whitepaper. It’s interesting to note that the complicated, optimized implementation that uses local memory only gains a 2x performance increase on the naive OpenGL implementation in the best case. Also note that that increase is a constant factor and does not scale asymptotically as the kernel radius increases which demonstrates that the GPU cache is actually already pretty efficient for this sort of task.

    Anyways my reply is probably as long as your original post, but I’ll summarize it by saying that, from a fair amount of experience with this sort of thing, both GPUs and Cell can be pretty efficient at image post processing. In particular while software-managed “scratch pads” may appear to be great at first, they greatly increase code complexity (just read that whitepaper!) and it’s not yet clear whether they actually provide significant performance benefits over more typical hardware-managed cache hierarchies.

  2. Arun Demeure Says:

    Andrew: One worry I have with letting the GPU texture cache do its thing is that the problem is not just access patterns; it’s also datapaths/number of requests per cycle.

    With scratchpads like on G80 and CELL, you can retrieve a lot of data per cycle to your ALUs. It’s practically as fast as the register file on G80… On the other hand, for every point sampled RGBA8 request on the 8800 GTX, you have the time to do 9.4 scalar operations (excluding the MUL). For scalar FP32, this is doubled, for reasons I will not go into here.

    This is a topic that came up with Marco when I was discussing his shadowing algorithms with him. Even if he had 100% cache hits, my conclusion was that point sampling would be the bottleneck, not the ALUs.

    I just looked at the CUDA convolution example, and indeed, this confirms my suspicion:
    #if 0
    // try this to see the benefit of using shared memory
    int pixel = getPixel(g_data, x+dx, y+dy, imgw, imgh);
    #else
    int pixel = SMEM(r+tx+dx, r+ty+dy);
    #endif

    // only sum pixels within disc-shaped kernel
    float l = dx*dx + dy*dy;
    if (l >8)&0xff);
    float b = float((pixel>>16)&0xff);
    #if 1
    // brighten highlights
    float lum = (r + g + b) / (255*3);
    if (lum > threshold) {
    r *= highlight;
    g *= highlight;
    b *= highlight;
    }
    #endif
    rsum += r;
    gsum += g;
    bsum += b;
    samples += 1.0;
    }

  3. lycium Says:

    this analysis only holds true for the very specialised O(1) box filtering algorithm.

  4. Marco Salvi Says:

    to Andrew: Any chance the researcher you spoke to co-authored this paper? We briefly discussed about t on Beyond3D, and the common consensus was that it’s not entirely clear if the methodology they adopted for their tests makes completely sense (a cache should be less dense than a ‘standard’ memory but it seems they didn’t account for that..)
    While I agree that (wisely) using a cache instead of a programmer managed memory is way simpler and effective (e.g you quickly reach some decent performance), I find difficult to believe that it can be as fast as a local store in the general case.

    Moreover I have to agree with Uttar here, on G80 (and even more on R600..) your bottleneck is likely to be in the texture caches, while on CELL you can load a full vec4 per clock cycle and do some work on it knowing that no other event in the system can stop you from doing what you’re doing. With G80 and its local-shared memory things are much more complicated if you don’t want to find your ALUs stepping over its each other feet all the time. (yeah, ALUs do have feet ๐Ÿ™‚ )
    Anyway given the company you work for I guess you know this stuff much better than me ๐Ÿ™‚

    I understand your point of view, but as a console programmer I’m not scared of devoting a lot of time to fine tune my algorithms for a specific game and platform. At the same time I can appreciate that this model doesn’t work very well in many other industries.. thus the need for having solutions that might be less efficient but much more effective when real world constraints are applied. (That’s why I think Intel is on something with Larrabee and its programming model..)

    to lycium: you are right, what I wrote only applies to box filtering algorithms, but I was not obviously trying to imply it works with everything. Still current GPUs can’t implement anything like that efficiently in one rendering pass, and also we should remember that box filters used recursively let us do a lot of nice things ๐Ÿ™‚

  5. Andrew Lauritzen Says:

    Arun: Certainly the overhead of using a cache is something to be considered, but as I’m not a hardware guy I can’t really discuss the tradeoffs there. On G80 there definitely is a cost as you note to texture lookups even if they are cached (I’ve actually found that cost to be even higher on ATI cards, but I’ve not had a lot of time to play with the R600). It shouldn’t be overstated though as we routinely see programs that access hundreds (and sometimes thousands!) of different elements, and the G80 still crunches through that fairly efficiently.

    Marco: The person I spoke to may have been one of the authors on that paper, but unfortunately I don’t remember his name ๐Ÿ˜ฆ

    As I mentioned, it’s easy to construct/observe cases (even simple convolution) where its possible to use software-managed local memories to get speed increases (usually only on the order of 1-3 times in my experience, but that can be significant). Conversely however it’s also easy to construct cases that perform poorly on Cell in particular. For example, programs with random-access reads of global memory – particularly within tight control flow – perform poorly compared to GPUs, even with a software texture cache in local memory. Thus things like (fragment) shading on the SPUs is actually fairly inefficient, although certainly GPUs have been optimized for this case ๐Ÿ™‚

    It’s true that box filters used recursively can do many cool things (approaching gaussian kernels), but actually the “specialness” of that case has one other implication: there are better ways to evaluate medium-to-large box filters than “brute force”. I’d argue that the “equivalent” way (to your optimized local memories example) to evaluate a box filter on GPUs for instance is to build a summed-area table (separable parallel scan -> O(N) for NxN elements) and then use it to evaluate arbitrarily large filters (O(1) per element). Indeed this implementation will probably surpass the speed of even the optimized “local memories” version for sufficiently large filters, and boasts excellent memory coherence for the case of constant-sized filters.

    That said, the most efficient way to implement “scan” on GPUs right now (at least in CUDA) involves using the local memories ๐Ÿ™‚

    Anyways I’m still a fan of software-managed caches (particularly at RapidMind we often *know* what data is coming up next, and thus can use – for example – Cell hardware very efficiently while we just have to trust GPUs to “do the right thing), but it’s not clear to me that they are necessarily superior on a theoretical level. They may be, but I need more convincing ๐Ÿ™‚

    The other orthogonal question is one of programming models. As I alluded to in the previous reply, there’s no reason why local memories have to be exposed as they are on Cell. Indeed there are nicer programming abstractions that can very effectively scale to N-level memory hierarchies without any application-programmer intervention. It may yet be another case where a higher-level abstraction will eventually lead to more efficient code, once the compilers are sufficiently mature.

    Anyways, interesting stuff ๐Ÿ™‚ Your blog so far has a 100% hit ratio on extremely relevant and thought-provoking material… keep it up!

  6. Marco Salvi Says:

    Hello Andy, thanks again for your insights!
    Until the other day I didn’t know much about RapidMind but I recently had the chance to attend a a small presentation about the technology you work on and I must say I was really impressed. It seems a lot of scientists in RapidMind do like CELL ๐Ÿ™‚
    I’m not sure if what I saw/heard is covered by some NDA so I won’t talk about it here but AFAIK all the other companies working in the same field (e.g. Peakstream, Codeplay,etc.. ) are way behind you guys.
    I totally agree with you that there’s really no reason to expose CELL hw in the current ‘naive’ way, we need far better abstractions for it (and not just cause a lot of programmers, at least in my industry, just don’t get it..no matter how hard you try to explain to them how it works…).

  7. Ignacio Castano Says:

    Andrew: I wouldn’t say that NVIDIA’s CUDA is a “low-level” API, quite the opposite. It lets you use a high level language derived from C. There are just some extensions to indicate whether functions are compiled for the device or the host, and to tell the compiler in what memory to allocate variables.

    It’s possible to use shared memory to optimize image processing operations they way Marco describes, in fact, we have a few SDK examples showing that:

    http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html

    In particular, the separable convolution example comes with a white-paper that shows how to use shared memory to save bandwidth by loading elements shared by different pixels only once:

    http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/convolutionSeparable/doc/convolutionSeparable.pdf

  8. purpledog Says:

    I’d like to have some number with non separable kernel as well… All the explanation with the “apron” is really very interesting (this guy knows how to make clear figures) but I was kind of dissapointed to realize that the code was actually 1D (convolutionSeparable.pdf, page 9).

  9. Manny Ko Says:

    I cannot agree with you more Marco.

  10. Cristina Says:

    Dear Mr.,

    I am looking for good material about texture caching on GPU. I have some doubts about how it is actually done, if by lines/columns or regions. Can you indicate me some?

    Thanks for your help,
    Cristina

  11. Marco Salvi Says:

    Hi Cristina,

    It’s not easy to find some sort of public documentation or paper that shows how a texture cache work in a GPU, most of the stuff I read so far is covered by some nda.
    As fair as I understand this work:
    http://portal.acm.org/citation.cfm?id=264152&dl=GUIDE&dl=ACM
    describes a cache architecture which is kind of similar to what you can find in modern GPUs.

    Marco

  12. graphsasy Says:

    I enjoy reading articles from your site. Keep on writing.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: