Another day, another HDR rendering trick and some hope for the future.

Today I’m going to talk about an idea I came up with in Boston, at Siggraph 2006, while attending a couple of very inspiring lectures given by Jason Mitchell, Gary McTaggart and Chris Green (Valve Software).

They have played over the years with a few different HDR rendering schemes and one of the key insights from their work is that we can happily de-couple exposure and tone mapping computations, deferring the latter to the next frame (actually this idea was first suggested to me by Simon Brown, that’s another story though..).

This simple concept allowed Valve guys to remove the classic full screen tone mapping rendering pass and to embed it directly in their single pass shaders using as exposure a value computed in the previous frames, thus completely eliminating the need to output HDR pixels.

Since at that time there was basically no hardware around that could handle MSAA on floating point render targets (oh gosh, just a few years ago!) they also got their HDR rendering implementation running with MSAA on relatively old hardware! Moreover their method executes tone mapping and MSAA resolve in the proper correct order (tone mapping first, followed by AA resolve) with no extra performance cost, something that a lot of modern games can’t still get right today.

If you were not aware of Valve’s method you are now probably asking yourself how they managed to compute an exposure value to be used in a tone mapping operator if no HDR data is ever dumped to the frame buffer. Through image segmentation techniques they ‘simply’ try to determine if the previous frame has been under or over exposed and a new exposure value is adjusted to compensate for problems with the previous frame(s).

While this method is very clever I have some problems with it. For instance many tone mapping operators require to determine exposure computing the average logarithmic luminance of a relevant portion of the image, but it’s not possible to reliably determine this value using Valve’s approach. HDR data is lost and while in theory we might be able to compute a plausible exposure value performing a search over multiple frames, in practice this is not easy at all. We might need to change the search direction over the exposure space to get closer to the exposure value we are looking for and this would make our image overall brightness swing back and forth for a few frames, like a pendulum around its rest configuration. Monotonic searches are possible too but they can get you only so close to the value we are looking for, especially if the image content is constantly changing!

Having debated this issue with current and former colleagues I know this is a controversial point, some agree with me, some think is not a big deal (and who knows, maybe they are even right!). On the other hand while playing Valve’s masterpieces (this method was first introduced in HL2 Lost Coast) I can’t stop noticing how sometime portions of the image are flat and seem to have lost their color details, giving me an overall flat and over or under saturated feeling (again, this is just a very personal and subjective opinion, feel free to disagree with me). This problem might be caused by a poor/overly simplified tone mapping operator (Valve games run great on not so powerful hardware and trade offs have to be made) and/or by an incorrect exposure (gotcha!).

After this long introduction I wouldn’t be surprised if you have already got my same idea: get rid of the exposure search through previous frames feedback and compute it the proper way!

The feedback/image segmentation method has been adopted because no HDR data is available, but even without re-introducing a floating point buffer (or some funky color space technique, see Christer Ericson’s blog entry about some of the work I did on Heavenly Sword and his very clever take on it) we can still generate the data we need using destination alpha. The idea is simple: compute logarithmic luminance on a per pixel basis, encode it in some special format and output it to the alpha channel. 

If we decide to support a certain luminance range [2^-minLogLum, 2^maxLogLum] we can compress and encode logarithmic lumimance in our single pass shaders using some fairly simple math:

float invLogLumRange =  1 / (maxLogLum + minLogLum);
float logLumOffset = minLogLum * invLogLumRange;
float log_luminance =  get_log_luminance( HDR_color ) * invLogLumRange + logLumOffset;

invLogLumRange and logLumOffset are constants that can precomputed so we just need a 3-way dot product, a scalar MADD and a logarithm to evaluate this formula. Explicitely clamping this expression between 0.0 and 1.0 is not necessary as the ROPs will do it for free anyway.

Since we only applied an affine transform to encode our log luminance is still correct to compute its average with multiple reduction passes as we do when we generate a mip map chain, down to a 1×1 render target, as long as we remember to invert the encoding to retrieve a proper average logarithm luminance value. Actually it’s a good idea to do this last step on the CPU (since this computation can be deferred one or two framaes we should be able to lock this specific resource and read it back with the CPU without stalling either processor) so that we can set our exposure for the next frame color pass as a pixel shader constant, removing any extra math and texture sampling from the 1×1 log luminance texture.

Unfortunately almost no trick comes for free, if we use destination alpha to encode logarithmic luminance we can’t use it for other useful operations such as alpha blending and alpha to coverage (alpha test is still doable as long as we implement it in our shaders invoking kill() or discard() ). I’m not particularly worried about alpha blending, we can simply compute our average logarithmic luminance before we render transparent objects, those won’t contribute to the exposure computations but I suspect this is not a big deal in many cases. The same trick can be applied for alpha to coverage objects, though I wouldn’t advocate it if we know we are going to render a lot of alpha coverage stuff on screen (for example think about lots of trees, it’s not probably going to work well if we are working on a Robin Hood game..)

Now we are free to implement a lot of different tone mapping operators in our single pass shaders, even if we are working on a deferred renderer, as long as its architecture can shade an opaque pixel for an arbitrary number of lights in a single pass, like in the ingenious scheme proposed by Pål-Kristian Engstad at Naughty Dog.

One last note: while I love (and I always will..) finding new and unexpected ways to use graphics hardware it’s clear to me things are going to change soon, very soon. Shaders allow us to do almost anything, but they are still encapsulated in a rendering pipeline that dates back to the late 80s and that has gone almost unchanged for the last twenty years. When I was a student I once used to write my own rendering pipeline (my beloved Amiga didn’t have a GPU..) which wasn’t always based on z-buffer and rasterization (though I wrote so many rasterizers I lost count of them..) and I’m glad of the cyclical nature of hardware development as we are now about to go back to the future and once again develop our own custom rendering architecture on top of recent years advancements. Only this time is going to be even more fun!

Fast Percentage Closer Filtering on Deferred Cascaded Shadow Maps

Here’s a nice trick for whoever has implemented (as a single pass that operates in screen space) deferred cascaded/parallel split shadow maps on hardware that does not allow to index texture samplers associated with different view space splits.

One way to address this issue is to store multiple shadow maps (usually one per split) into a single depth texture, then the correct shadow map can be sampled computing (per pixel!) all the possible texturing coordinates per each split and selecting the right one through predication. Another method quite similar to the first one involves removing the predication step and replacing it with dynamic branches. This can end up being faster the the predication method on some hardware, especially if we have to select the correct sampling coordinates among many splits.

But what if we want to take a variable amount of PCF samples per split without using dynamic branching? (I love DB but it’s not exactly fast on every hardware out there, is up to you decide when it’s a good idea to use it or not).
It’s indeed possible to take a dynamic number of samples per pixel using an hardware feature that was initially introduced by NVIDIA to accelerate…stencil shadows! (ironic, isn’t it?)

I am talking about the depth bounds test which is a ‘new’ kind of depth test that does not involve the depth value of the incoming fragment but the depth value written in the depth buffer at the screen space coordinates determined by the incoming fragment. This depth value is checked against some min and max depth values; when it’s not contained within a depth region the incoming fragment gets discarded. Setting the depth min and max values around a shadow map split is a easy way to (early!) reject all those pixels that don’t fall within a certain depth interval. At this point we don’t need to compute multiple texture coordinates per pixel anymore, we can directly evaluate the sampling coordinates that map a pixel onto the shadow map associated with the current split and take a certain number of PCFed samples from it.

Multiple rendering passes are obviously needed (one per split) to generate an occlusion term for the whole image but this is not generally a big problem cause the depth bounds test can happen before our pixel shader is evaluated and multiple pixels can be early rejected per clock via hierarchical z-culling (this approach won’t be fast if our image contains a lot of alpha tested geometry as this kind of geometry doesn’t not generate occluders in the on chip hi-z representation forcing the hardware to evaluate the pixel shader and eventually to reject a shaded pixel).

The multipass approach can be a win cause we can now use a different shader per shadow map split making possible to take a variable number of samples per split: typically we want to take more samples for pixels that are closer to the camera and less samples for distant objects. Another indirect advantage of this technique is the improved texture cache usage as the GPU won’t jump from a shadow map to another shadow map anymore. In fact with the original method each pixel can map anywhere in our big shadow map, while going multipass will force the GPU to sample locations within the same shadow map/parallel split.

I like this trick cause even though it doesn’t work on every GPU out there it puts to some use hardware that was designed to accelerate a completely different shadowing technique. Hope you enjoyed this post and as usual comments, ideas and constructive critics are welcome!

A simple way to pre-compute sin/cos tables

This is a kind of an old school trick, rarely useful nowadays, but it’s nonetheless interesting due to its simplicity and elegancy.
The main idea is about filling a table with pre-computed values of\sin a and\cos a within a certain interval, using infinitesimal rotations.We start to work in two dimensions with a clockwise rotation matrix:

R = \begin{pmatrix}\cos\alpha\ && -\sin\alpha\  \\ \sin\alpha\ && \cos\alpha\ \end{pmatrix}

For very small rotation angles\alpha\ \to 0 thus we can expand our rotation matrix elements with a Taylor series around zero.
Ignoring all the non linear terms we get something like this:

R' = \begin{pmatrix}1  && -\alpha\  \\ \alpha\  && 1 \end{pmatrix}

Now we can generate our values on an interval\left [ s, t \right ) recursively applyingR' on a unit vector\hat{u} = ( \cos s, \sin s ) .

\hat{u'} = \hat{u}R'

=( u_x, u_y ) \begin{pmatrix}1  && -\alpha\  \\ \alpha\ &&  1 \end{pmatrix}

= \begin{cases} u'_x = u_x +  u_y\alpha\ \\ u'_y = -u_x\alpha\ + u_y \end{cases}

Supposing that we don’t have at hand any code or library that can compute\hat{u} for an arbitrary value ofs , we start with s = 0 \to \hat{u} = ( 1, 0 ). Read the rest of this entry »