Fast Percentage Closer Filtering on Deferred Cascaded Shadow Maps

Here’s a nice trick for whoever has implemented (as a single pass that operates in screen space) deferred cascaded/parallel split shadow maps on hardware that does not allow to index texture samplers associated with different view space splits.

One way to address this issue is to store multiple shadow maps (usually one per split) into a single depth texture, then the correct shadow map can be sampled computing (per pixel!) all the possible texturing coordinates per each split and selecting the right one through predication. Another method quite similar to the first one involves removing the predication step and replacing it with dynamic branches. This can end up being faster the the predication method on some hardware, especially if we have to select the correct sampling coordinates among many splits.

But what if we want to take a variable amount of PCF samples per split without using dynamic branching? (I love DB but it’s not exactly fast on every hardware out there, is up to you decide when it’s a good idea to use it or not).
It’s indeed possible to take a dynamic number of samples per pixel using an hardware feature that was initially introduced by NVIDIA to accelerate…stencil shadows! (ironic, isn’t it?)

I am talking about the depth bounds test which is a ‘new’ kind of depth test that does not involve the depth value of the incoming fragment but the depth value written in the depth buffer at the screen space coordinates determined by the incoming fragment. This depth value is checked against some min and max depth values; when it’s not contained within a depth region the incoming fragment gets discarded. Setting the depth min and max values around a shadow map split is a easy way to (early!) reject all those pixels that don’t fall within a certain depth interval. At this point we don’t need to compute multiple texture coordinates per pixel anymore, we can directly evaluate the sampling coordinates that map a pixel onto the shadow map associated with the current split and take a certain number of PCFed samples from it.

Multiple rendering passes are obviously needed (one per split) to generate an occlusion term for the whole image but this is not generally a big problem cause the depth bounds test can happen before our pixel shader is evaluated and multiple pixels can be early rejected per clock via hierarchical z-culling (this approach won’t be fast if our image contains a lot of alpha tested geometry as this kind of geometry doesn’t not generate occluders in the on chip hi-z representation forcing the hardware to evaluate the pixel shader and eventually to reject a shaded pixel).

The multipass approach can be a win cause we can now use a different shader per shadow map split making possible to take a variable number of samples per split: typically we want to take more samples for pixels that are closer to the camera and less samples for distant objects. Another indirect advantage of this technique is the improved texture cache usage as the GPU won’t jump from a shadow map to another shadow map anymore. In fact with the original method each pixel can map anywhere in our big shadow map, while going multipass will force the GPU to sample locations within the same shadow map/parallel split.

I like this trick cause even though it doesn’t work on every GPU out there it puts to some use hardware that was designed to accelerate a completely different shadowing technique. Hope you enjoyed this post and as usual comments, ideas and constructive critics are welcome!


23 Responses to “Fast Percentage Closer Filtering on Deferred Cascaded Shadow Maps”

  1. Aras Pranckevicius Says:

    As a side note, it’s not required to use predication or dynamic branching to select the right shadow map (this is important on vanilla ps2.0). Here is roughly what we do:

    float4 near = float4( z >= _LightSplitsNear );
    float4 far = float4( z < _LightSplitsFar );
    float4 weights = near * far;
    float4 coord =
    i._ShadowCoord[0] * weights[0] +
    i._ShadowCoord[1] * weights[1] +
    i._ShadowCoord[2] * weights[2] +
    i._ShadowCoord[3] * weights[3];

  2. Marco Salvi Says:

    Hi Aras! Thanks for your post.
    I’m not sure to follow you here cause what you wrote in your comment is exactly what I call predication (or a form of it).
    Moreover how do you compute your _ShadowCoord?

  3. Marco Salvi Says:

    Aras: any chance you’re not computing your occlusion term with a full screen pass (e.g. your shadow coords are computed in a vertex shader and then interpolated)? Cause this is not the case I’m talking about, I would never advocate to re-send part or all the scene geometry multiples times per frame to the GPU just to ‘accelerate’ this stage of the rendering pipeline. I don’t think it would be any faster on any GPU out there 🙂 Going to add in the article that I’m explicitely talking about occlusion term computations via screen space operations

  4. ShootMyMonkey Says:

    Either way, it seems to me you could do an approach like that anyway since you’ve got all your shadow maps in a single map, so you can sample all of them simultaneously, and simply weight the one you actually need to use with a comparison like that. I’m not sure what he’s doing with his _ShadowCoord, but you could do the same with the actual occlusion term. That’s actually probably what I’d try first before I considered branching, as experience has made me extremely paranoid of any and all dynamic branching on GPUs. I basically only use it for conditions on very sizeable blocks of code where the test is significantly more likely to fail than to pass.

  5. Marco Salvi Says:

    SMM: why would you want to sample all the shadow maps? It’s a costly operation (PCF samples are ‘slow’…) that can be avoided. Or am I misunderstanding your idea here?

  6. Stephen Hill Says:

    Ah, it all clicks now as I recall you previously alluding to this the B3D forums.

    Sadly depth bounds support is sorely missing from certain [other] console hardware, though you’ve inspired me to revisit this area; I can think of a decent hardware-specific hack that should yield similar benefits. Weeee!

  7. ShootMyMonkey Says:

    Well, I was referring to another form of the multiple cascades/splits in a single map. One of the things we experimented with at my previous job before I moved to the Bay was doing split shadowmaps for directional lights, where you basically took 1 sample in an RGBA map (tried various formats here) and each channel was the depth map of a separate split. The cost of constructing the map this way was relatively minor since you had to do multiple splits anyway. So basically taking one distance sample in the map covered 4 cascades in one 4-vector sample. You could accumulate PCF over 4-vector samples and then simply select the one that was actually valid with a simple dot product.

    We played around with this (although the guy who did most of the implementation preferred to use branching), but didn’t find it to perform that much faster than texture arrays, and the main reason we went with texture arrays in the end is that it ultimately, accompanied a series of additional nudges, opened up 2 more available texture slots for the artists to use.

  8. purpledog Says:

    I don’t understand 😦

    Can you confirm that:
    – each split is a shadow map for a given range of the Zeye
    – the shadow term is generated (in a separate RTT) by n fullscreen quad passes (n is the number of split).
    – for each pass, the range of EXT_depth_bounds_test is set to the associated split

    Sorry if that’s all obvious or all very stupid, but I feel like I’m missing some hidden assumption here…

  9. Marco Salvi Says:

    Purpledog: got it! this is exactly what HS does.. 🙂

  10. Marco Salvi Says:

    Stephen: I’m trying to guess what [other] consolde hardware feature/hack you’re thinking about..but I can’t come up with anything that makes sense. Can you drop any hint?

  11. Stephen Hill Says:

    I haven’t tested my theory, but I was thinking along the lines of destructively overwriting the depth buffer with each compositing slice, to update early rejection in the next pass via HiZ.

    There are some arcane low-level optimisations that can be leveraged here, which I believe were covered at Gamefest this year.

  12. Marco Salvi Says:

    Cool idea! (you can apply tricks like that on the ‘other’ console as well imo), now I have to dig a bit in Gamefest ppts, didn’t have the time to read them yet.

  13. Stephen Hill Says:

    I don’t know if it was covered in the end as the topics changed a little and it’s not mentioned in the abstracts any more. I’m awaiting the slides just like most other people! *hint hint*

  14. oladotunr Says:

    Thanks for the blog comment Marco!

    Great blog you got here!

    I’ve added a link to mine if that’s ok..

    (Ola)Dotun R

  15. Abdul Bezrati Says:

    The first post by Aras demonstrates how adding a few math ops in the pixel shader we can work around the multiple shadow map texture sampling that I have seen in almost all working demos of PSSM.
    It is often true that the overhead of using math ops is much lower than that produced by texture reads.
    I will be implementing PSSM in the current two games that I am simultaneously working on.
    The implementation that I am aiming for is packing three shadow maps of size 512^2 along a 512*1536 render target or maybe two 512^2 and a single 1024^ packed into a 1536*1024 target.
    Choosing the right target to sample from should take no more than 6 instructions which isn’t terrible 🙂
    I was also thinking about adding VSM to the mix and apply Gaussian filtering to only the 1024*1024 portion of the atlas.

  16. Marco Salvi Says:

    Hi Abdul,

    unfortunately Aras hasn’t replied to my questions but my original post was already taking for granted that we don’t want to sample multiple shadow maps just to reject the unwanted samples. My idea is about not computing multiple texture coordinates at all, moreover I’m not sure how Aras approach works cause it seems is passing sampling coordinates from the VS to the PS stage, which means he’s not using a full screen deferred approach anyway. Are you referring to the same idea?
    You say that chosing the right render target to sample from only takes 6 instructions..but how do you compute the (per shadow map) texturing coordinates in the first place? In a VS?
    If you wanna use VSM with PSSM read Andrew Lauritzen’s article on GPU GEMS3, it will save you a lot of hard work 🙂

  17. Abdul Bezrati Says:

    It is true that I wasn’t talking about the deferred rendering approach; I was describing a general case where you do compute N different shadow texture coordinates and moving them to the pixel shaders. At this stage we are doing the same work as all PSSM demos that I have seen out there, however what I assumed was Aras’ idea substitutes the N – 1 sample fetches with roughly (N + 3-4) math instructions.
    I have a copy of GG3 sitting on my desk right now, so I’ll take a look at the SAT VSM chapter and I ll get back to you.
    Thanks for the pointer ^_^

  18. Andrew Lauritzen Says:

    Check out the demo+source on the GPU Gems 3 DVD accompanying my chapter as well (Chapter 8 IIRC)… it includes a fully working version of parallel-split variance shadow maps (PSVSM) even though I didn’t cover them specifically in the chapter. I’m sure the implementation could be improved significantly (particularly the PSSM part), but it’s probably a good starting point, or at least useful to glance at.

    I also explained a few little issues that come up when using PSSM with properly filtered shadows (i.e. tex coord derivatives). I used a nice little hack to get around them – which I described in a post on Beyond3D – but you can always do the “proper” way of just computing all of the split texture coordinates (N matrix-vector multiplies unfortunatey, but may not be critical), and their derivatives, and then predicate and use a derivative texture lookup.

    Also feel free to fire me an e-mail if you have any questions.

  19. Idetrorce Says:

    very interesting, but I don’t agree with you

  20. Marco Salvi Says:

    Hi Idetrorce,
    what are you referring to?

  21. pthiben Says:

    Hi Marco,

    Very interesting article. However I didn’t understand your remark:

    “Hi Aras! Thanks for your post.
    I’m not sure to follow you here cause what you wrote in your comment is exactly what I call predication (or a form of it)”.

    I believe he just uses the a0 register.

    Just to make things clear: I implemented similar algorithm to Aras’s which could be used in both a defered or forward shadowing system:
    Let’s say we work in a deferred system with 4 splits: You retrieve you per-pixel depth from the camera and deduce your world position in your PS. You send the 4 Light matrices
    and use:
    float4 near = float4( zeye >= _LightSplitsNear );
    float index = dot(float4(1,1,1,1) * near);
    Then you use index to choose among the 4 light matrices.
    So on a PS-SM3, this would add only a cmp and a dp4.

    Are you saying that performing multiple passes with this early out would be more efficient on “a certain platform” ?


  22. Marco Salvi Says:

    Hi pthiben,

    AFAIK SM3 doesn’t offer an index register in pixel shaders so it’s not possible to index a set of matrices as we are able to do with vertex shaders.
    I’ve just run a sanity check on DX documentation and I can’t see any reference to an indexing register in pixel shaders (SM3.0):

    ps_3_0 Registers

    The Cg/HLSL compiler might very well compile that code but in the end it will generate code to compute every possible outcome through predication.

    And early out might be more efficient on certain platforms as the overhead associated to each pass can be very small.

  23. pthiben Says:

    Thanks for the information.
    My mistake about the relative adressing in SM 3.0: I was sure that it was present, but you’re right: It will probably be compiled through predication. Sorry about that.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: