Another brief update: GDC08 and ShaderX6

How do you feel right before giving the first lecture of your life? The answer is: not well!

I felt so tense that I forgot half of what I wanted to say, and even though I was probably speaking a bit too fast I believe my lecture at GDC went quite well. The room was packed with people, despite the fact that was the last talk of the day. At the end of the presentation I was asked a fair number of pertinent questions, and I was kind of surprised to realize that some didn’t believe that such extremely simple approaches could possibly work!

For all those that couldn’t come be there Wolfgang Engel (organizer of the Core Techniques & Algorithms in Shader Programming day at GDC) has collected all the presentations in one handy page. You can also download my presentation about some new and exotic shadow mapping filtering schemes directly from this link. Comments were added to the most obscure slides and as usual feel free to ask questions, comments, point out errors, etc.. on this blog.

This talk was also represented the first occasion to publicly introduce Exponential Shadow Maps. The main idea has been going around for a while now (and it seems some highly anticipated game will soon ship with it..) but it wasn’t possible to fully explain the algorithm and why it works (or doesn’t work, in some case :) ) due the limited amount of time available.

On the other hand ShaderX6 is now widely available and you will be able to find in it a long and detailed article about ESM (and some sample code), among many other interesting contributions.

Time for a brief update

Four months ago I tried to stimulate your curiosity with this post.
Next month I will try to stimulate it even more with this talk at the upcoming Game Developers Conference in San Francisco.
A few simple approaches will be presented, though I will focus a bit more on a technique called ESM (short for Exponential Shadow Maps) which is also the subject of an article that will appear on ShaderX 6.
That’s all right now, hope to see you at GDC!

Fast Percentage Closer Filtering on Deferred Cascaded Shadow Maps

Here’s a nice trick for whoever has implemented (as a single pass that operates in screen space) deferred cascaded/parallel split shadow maps on hardware that does not allow to index texture samplers associated with different view space splits.

One way to address this issue is to store multiple shadow maps (usually one per split) into a single depth texture, then the correct shadow map can be sampled computing (per pixel!) all the possible texturing coordinates per each split and selecting the right one through predication. Another method quite similar to the first one involves removing the predication step and replacing it with dynamic branches. This can end up being faster the the predication method on some hardware, especially if we have to select the correct sampling coordinates among many splits.

But what if we want to take a variable amount of PCF samples per split without using dynamic branching? (I love DB but it’s not exactly fast on every hardware out there, is up to you decide when it’s a good idea to use it or not).
It’s indeed possible to take a dynamic number of samples per pixel using an hardware feature that was initially introduced by NVIDIA to accelerate…stencil shadows! (ironic, isn’t it?)

I am talking about the depth bounds test which is a ‘new’ kind of depth test that does not involve the depth value of the incoming fragment but the depth value written in the depth buffer at the screen space coordinates determined by the incoming fragment. This depth value is checked against some min and max depth values; when it’s not contained within a depth region the incoming fragment gets discarded. Setting the depth min and max values around a shadow map split is a easy way to (early!) reject all those pixels that don’t fall within a certain depth interval. At this point we don’t need to compute multiple texture coordinates per pixel anymore, we can directly evaluate the sampling coordinates that map a pixel onto the shadow map associated with the current split and take a certain number of PCFed samples from it.

Multiple rendering passes are obviously needed (one per split) to generate an occlusion term for the whole image but this is not generally a big problem cause the depth bounds test can happen before our pixel shader is evaluated and multiple pixels can be early rejected per clock via hierarchical z-culling (this approach won’t be fast if our image contains a lot of alpha tested geometry as this kind of geometry doesn’t not generate occluders in the on chip hi-z representation forcing the hardware to evaluate the pixel shader and eventually to reject a shaded pixel).

The multipass approach can be a win cause we can now use a different shader per shadow map split making possible to take a variable number of samples per split: typically we want to take more samples for pixels that are closer to the camera and less samples for distant objects. Another indirect advantage of this technique is the improved texture cache usage as the GPU won’t jump from a shadow map to another shadow map anymore. In fact with the original method each pixel can map anywhere in our big shadow map, while going multipass will force the GPU to sample locations within the same shadow map/parallel split.

I like this trick cause even though it doesn’t work on every GPU out there it puts to some use hardware that was designed to accelerate a completely different shadowing technique. Hope you enjoyed this post and as usual comments, ideas and constructive critics are welcome!

Why GPUs are not (so) good at post processing images

Now I know the title of this post may sound controversial, but keep on reading, you might even end up agreeing with me.
Post processing effects are often implemented via convolutions, reductions and more sophisticated filters such as bilateral filters. What is interesting here is that the vast majority of these techniques all have something in common: data needed to work on a pixel are often required to work on neighboring pixels as well. This is an important property which is usually referred as ‘data locality’ and GPUs exploit it through texture caches.

Pixel shaders programming model has been designed to enable maximum parallelism. Each pixel is processed independently from any other pixel and no intercommunication is possible within a rendering pass. This model allows to pack an high number of processors on a single chip without having to worry about their mutual interactions, as no data synchronization between them is necessary..ever! This is also why the only atomic operations implicitly supported by current GPUs are related to alpha blending.

While this model has been incredibly successful it’s not exactly optimal when we would like to share data and computations across many pixels. As a practical example we can consider a simple NxN separable box filter which can be implemented on a GPU in two passes; each pass requires to fetch N samples and perform N-1 MADDs (or N-1 ADDs and one MUL) in order to compute a filtered pixel. This way to process data is clearly sub-optimal as while we move over the image horizontally or vertically each not-yet-filtered pixel shares N-1 neighboring pixels with the previously filtered pixel. Why wouldn’t we want to re-use those informations? Unfortunately we can’t, on a GPU..

..On the other hand modern multi-core CPUs can easily outperform GPUs in this case. Each core could be assigned to not overlapping areas of the image, moreover a pixel can accumulate its filtered value in a register and pass it to the next pixel.
At each iteration we wouldn’t need to fetch N samples anymore, we can in fact re-use the value generated by the previous iteration. As the filter kernel moves horizontally or vertically over the image at each new step an old pixel as to be removed from the accumulator and a new one can be added to it. Like in a fifo buffer we remove the first entry in the fifo and we add a new one (the fifo can be managed via registers rotations or using a modulo N-1 pointer to an array of values). At each step only one new value has to be read from memory (two if we implement it with a rotating pointer) and only one value has to be written to memory (or two again..). An addition, a subtraction and a multiplication would be needed per iteration to filter a pixel, no matter how wide our box filter is! More complex separable filters as tent or gaussian filters can be easily implemented multi-passing the work done on a column or a row of the image two or three times; for large filter kernels this is going to be incredibly fast as again the computation time does not depend by the filter width anymore

On a processor like CELL a single SPU would probably be able to perform such a task on single precision pixels at 4-5 cycles per pixel with fully pipelined and unrolled loops (data transfer from and to external memory are handled via asynchronous DMA operations). 6 SPUs could apply a box filter with ANY kernel size (even the whole screen!) on a 720p image over 40 times per frame, at 60 frames per second, and consuming only a relatively small amount of memory bandwidth as most read and write operations would happen on chip. This is admittedly a best case scenario, something slightly more complex would be necessary to also have a very good accuracy on the filtered data, but it’s clear that an architecture like CELL is very well versed in this kind of data processing. I hope to see more and more games in the future implementing their full post processing pipeline on CELL and off loading this way a ton of work from RSX that could then spend more time doing what it does best (e.g. shading pixels which don’t share a great amount of data and computations!)

It’s interesting to note that the last GPU architecture from NVIDIA (G8x) was also designed to be more efficient at tasks as image post processing. For the first time developers can leverage on a small on chip memory which is shared by a cluster of ALUs (each G8x GPU has a variable number of ALUs clusters..), each cluster has its own shared on chip memory which is split over multiple banks. As each memory bank can only serve one ALU per clock cycle it’s not certainly straightforward to make a group of processors use this memory without incurring in memory stalls or synchronization issues, nonetheless it looks we are at least going toward the right direction. Unfortunately at the time I’m writing this I’m not aware of any way to use these capabilities via some graphics API as D3D or OpenGL.

So.. do you still think GPUs rock at image post processing? ;)
Feel free to post your own take on the subject, rants and errata corrige.