Why GPUs are not (so) good at post processing images

Now I know the title of this post may sound controversial, but keep on reading, you might even end up agreeing with me.
Post processing effects are often implemented via convolutions, reductions and more sophisticated filters such as bilateral filters. What is interesting here is that the vast majority of these techniques all have something in common: data needed to work on a pixel are often required to work on neighboring pixels as well. This is an important property which is usually referred as ‘data locality’ and GPUs exploit it through texture caches.

Pixel shaders programming model has been designed to enable maximum parallelism. Each pixel is processed independently from any other pixel and no intercommunication is possible within a rendering pass. This model allows to pack an high number of processors on a single chip without having to worry about their mutual interactions, as no data synchronization between them is necessary..ever! This is also why the only atomic operations implicitly supported by current GPUs are related to alpha blending.

While this model has been incredibly successful it’s not exactly optimal when we would like to share data and computations across many pixels. As a practical example we can consider a simple NxN separable box filter which can be implemented on a GPU in two passes; each pass requires to fetch N samples and perform N-1 MADDs (or N-1 ADDs and one MUL) in order to compute a filtered pixel. This way to process data is clearly sub-optimal as while we move over the image horizontally or vertically each not-yet-filtered pixel shares N-1 neighboring pixels with the previously filtered pixel. Why wouldn’t we want to re-use those informations? Unfortunately we can’t, on a GPU..

..On the other hand modern multi-core CPUs can easily outperform GPUs in this case. Each core could be assigned to not overlapping areas of the image, moreover a pixel can accumulate its filtered value in a register and pass it to the next pixel.
At each iteration we wouldn’t need to fetch N samples anymore, we can in fact re-use the value generated by the previous iteration. As the filter kernel moves horizontally or vertically over the image at each new step an old pixel as to be removed from the accumulator and a new one can be added to it. Like in a fifo buffer we remove the first entry in the fifo and we add a new one (the fifo can be managed via registers rotations or using a modulo N-1 pointer to an array of values). At each step only one new value has to be read from memory (two if we implement it with a rotating pointer) and only one value has to be written to memory (or two again..). An addition, a subtraction and a multiplication would be needed per iteration to filter a pixel, no matter how wide our box filter is! More complex separable filters as tent or gaussian filters can be easily implemented multi-passing the work done on a column or a row of the image two or three times; for large filter kernels this is going to be incredibly fast as again the computation time does not depend by the filter width anymore

On a processor like CELL a single SPU would probably be able to perform such a task on single precision pixels at 4-5 cycles per pixel with fully pipelined and unrolled loops (data transfer from and to external memory are handled via asynchronous DMA operations). 6 SPUs could apply a box filter with ANY kernel size (even the whole screen!) on a 720p image over 40 times per frame, at 60 frames per second, and consuming only a relatively small amount of memory bandwidth as most read and write operations would happen on chip. This is admittedly a best case scenario, something slightly more complex would be necessary to also have a very good accuracy on the filtered data, but it’s clear that an architecture like CELL is very well versed in this kind of data processing. I hope to see more and more games in the future implementing their full post processing pipeline on CELL and off loading this way a ton of work from RSX that could then spend more time doing what it does best (e.g. shading pixels which don’t share a great amount of data and computations!)

It’s interesting to note that the last GPU architecture from NVIDIA (G8x) was also designed to be more efficient at tasks as image post processing. For the first time developers can leverage on a small on chip memory which is shared by a cluster of ALUs (each G8x GPU has a variable number of ALUs clusters..), each cluster has its own shared on chip memory which is split over multiple banks. As each memory bank can only serve one ALU per clock cycle it’s not certainly straightforward to make a group of processors use this memory without incurring in memory stalls or synchronization issues, nonetheless it looks we are at least going toward the right direction. Unfortunately at the time I’m writing this I’m not aware of any way to use these capabilities via some graphics API as D3D or OpenGL.

So.. do you still think GPUs rock at image post processing?;)
Feel free to post your own take on the subject, rants and errata corrige.