A peek into the future of interactive computer graphics

Ars Technica’s recently published interview with game developer extraordinaire Tim Sweeney has given me the perfect excuse to finally sit down and write a few thoughts on the future of GPUs and real-time graphics in general.

In his interview Mr. Sweeney makes some interesting points about the next generation of graphics hardware & software architectures:

  1. 3D APIs as we know them are a thing of the past and will soon die, replaced by more flexible software rendering pipelines implemented with CUDA/Compute Shaders/OpenCL or other languages.
  2. (Some) fixed function/not programmable hardware units will still make sense for the foreseeable future.
  3. DirectX 9 has been the last important revolution in 3D APIs, everything that followed or that will come next won’t have such a dramatic impact on computer graphics engineers and researchers’ life.
  4. A good auto-vectorizing C++ compiler on all next gen platforms is perhaps all we need, developers will take care of the rest.
  5. Next gen consoles might be based on a single massively parallel IC with general purpose computing capabilities plus some fixed function hardware units to speed up certain graphics related tasks such as texture mapping or rasterization.

3D APIs

Regarding the first point I believe Tim Sweeney’s view is quite optimistic, many developers will neither be able nor interested in implementing their own rendering pipeline. 3D APIs, as we know them, will perhaps slowly lose their relevance, though I don’t think their premature death is going to happen anytime soon.

There is some chance that all present and future big players in the market (namely NVIDIA, AMD, Intel and Microsoft) will agree on a common way to ‘hijack’ the current 3D pipeline, allowing developers to add new stages and to bypass old ones. This might sound like a good option for whoever wants to be creative without having to entirely lose the benefits of something which is proven, works well and can be efficiently re-used. It might even open a whole new world of possibilities for middle-ware developers.

Fixed Function What?

For all those skilled in the arts point number two is a no brainer. For example TMUs’ dedicated logic performs tasks such as texture addressing, fetching, de-compression and filtering; If you have ever written a software renderer then you have experienced first hand how most of these operations are not amenable to be easily and efficiently implemented in software.

Custom rasterization hardware won’t likely disappear that soon either. Even Intel, that will not employ rasterization logic on Larrabee, agrees that “rasterization is unquestionably more efficient in dedicated logic than in software when running at peak rates“. That’s why fixed function hardware will likely stay with us for many years to come.

It’s interesting to notice how with Larrabee Intel got rid of a long list of dedicated hardware blocks that have been part of GPUs for a long time. Here’s a list of the most important ones:

  • input assembly (already implemented in software on some GPUs).
  • pre and post transformed vertex caches.
  • primitive assembly, culling & setup.
  • hierarchical z-buffer.
  • rasterization.
  • attributes interpolation (partially implemented in software on NVIDIA GPUs).
  • all output merge stages: alpha/stencil/depth tests, blending, alpha to coverage, etc.
  • color, z and stencil compression.
  • a plethora of obscure and relatively small fifos and caches.

Intel has clearly made a bold move here. They are taking huge risks and only time (and competition from other companies) will tell whether they are right or not.

Their software renderer seems to be incredibly well architected and it’s a pity we had to wait so many years to see a big player adopting a tile based deferred renderer. One of the few advantages of TBDRs over immediate mode renderers is that they can be more efficient at using programmable hardware and memory bandwidth, making some  dedicated logic unnecessary. Say goodbye to color and z compression, and don’t forget to commemorate output merge stages (aka ROPs) for all the good work they have done over the last 15-20 years!

Unfortunately we all know that nothing is for free and increased flexibility will come at a certain cost (this kind of bills are usually paid in perf/mm2 and perf/watt). On the other hand, giving up a big chunk of often idle dedicated logic is a great way to have more & more programmable hardware on board, which is inherently less likely to be inactive at any given time. A simple picture of NVIDIA GT200 can give a rough idea of how much area is spent on fixed function units, as you can see at least a fourth of the chip area is devoted to non programmable hardware.

DirectX

DirectX 9 was a huge step in the right direction, and DirectX 10 is helping consolidating that step adopting new render states, driver and unified shading models. In contrast, for a variety of reasons that go from <what am I supposed to do with this?> to <it’s not a very clean design> I am not exactly enamored with DX10’s geometry shaders or DX11’s three brand new tessellation stages. I think these recent developments show us that as we enter in partially uncharted territory we don’t know yet which direction should be taken.

That’s why as we move towards more flexible and open rendering pipelines computer graphics researchers and game developers will unleash their imagination and come up with new interesting ideas. We will certainly see old but high profile graphics research brought back to life again (A-buffer anyone?) and used in real-time applications such as video games. Perhaps in ten years or more, after long and fruitful experimentation, we will settle down for a new and specific rendering pipeline model and it will be “The Wheel of Reincarnation” all over again!

Is CUDA good enough?

We will soon have at least three different CUDA-like languages to play with: CUDA, OpenCL and DX11’s compute shaders and each of them seem to be well versed in exploiting data level parallelism. Sweeney thinks we can fully implement a modern rendering pipeline with languages like C++ or CUDA, though I have a couple of concerns about CUDA-like languagues:

  • CUDA memory model is complex and it’s tied to NVIDIA hardware. Will it scale well on future hardware?
  • Many algorithms map poorly to DLP.

Conversely I expect CUDA and its younger siblings to evolve quite rapidly and embrace other forms of parallelism (it seems OpenCL will support some sort of thread level parallelism..), and here lies my hope to see some major innovation in this area. Speaking of which Intel is also working on the Ct programming language that promises to breathe new life into the nested data parallel programming paradigm. Notice how all these new languages are based on dynamic JIT-style compilers: a necessary step in order to abstract code from specific hardware quirks, to maintain compatibility across the board and ensure scalability over next hardware generations.

Tim Sweeney also advocates the use of auto-vectorizing compilers, which occurs to me tend to be effective only at exploiting DLP and not much else. That’s perfect for pixel shaders et similia, not so good for all sort of tasks that don’t need to work on a zillion entities or that need some sort of control on how threads are created, scheduled and destroyed (unless you are brave enough to manually manage dozens or even hundreds of threads).

Can One Chip Rule Them All?

Following Mr. Swenney’s suggestion: how likely is to have in a few years a first game console entirely based on a single chip or at least on a single massively parallel architecture? I don’t want to dig too much into this extremely interesting topic as I would like to discuss it at length in a future post, but let me say that what is in the realm of possibility is not always feasible (yet).

A Glimpse Of The Future

In this long post I have been talking extensively about a future where a rendering pipeline is more general, flexible and less tied to a specific hardware implementation, so it is perhaps time to show what this all means in terms of real change. I don’t want to take in consideration particularly exotic and unproven stuff, as I believe there is a lot of cool work to be done without having to throw the metaphorical baby out with the water!

For instance, it occurred to me many times that there is nothing inherently special in a stencil buffer that diversifies it from a color buffer or a z-buffer, unless we take in consideration the status it assumes in the rendering pipeline thanks to the stencil test. While fifteen years ago made perfect sense to have such an hardwired capability, now it feels more like an old gimmick that was not improved over the years while the rest of the pipeline was getting more modern and flexible.

Since we are at it what about alpha test, alpha blending and alpha-to-coverage? Why is the stencil buffer just using 8 bit per pixel? Why is the set of operations it supports so limited? Why can’t I have my own special alpha blending operations? And most of all do these old features still make sense?

Of course they do, I use them all the time! But as it happens to many other engineers and researchers I find myself fighting them on a daily basis while trying to bypass their awkward limitations. There is clear lack of generality and orthogonality with respect to the rest of the pipeline, and that’s why I am convinced that the whole set of output merge stages need to be re-architect-ed. We know that as the hardware evolves it gets rid of fixed function units, but these changes won’t automagically fix the software layers that go on top of it.

It would be nice if we could remove these features:

  • stencil buffer & stencil test
  • alpha blending, alpha test and alpha to coverage.
and replace them with generic shaders that:
  • can be invoked before and/or after fragment shading
  • can read from and write to all render targets
  • can kill a fragment and/or generate a coverage mask for it (to avoid aliasing..)

For example a stencil buffer would be just another render target (don’t forget we had support for multiple render targets for years now) and these shaders could be automatically linked by the driver to the main fragment shader or kept separate and executed in multiple stages. I have to admit that while I’m writing these few lines I’m having something like Larrabee and its software renderer in mind, but I wouldn’t be surprised if in two years from now the rest of the graphics hardware landscape ends up being much more similar to Larrabee than current GPUs.

Final Words

Even barring incidental display devices breakthroughs I believe no one knows for sure how we will do graphics in 10-15 years from now. That’s why it is hard to disagree with Mr. Sweeney when he notes that the next few years are going to be very exciting for engineers and researchers!

Another day, another HDR rendering trick and some hope for the future.

Today I’m going to talk about an idea I came up with in Boston, at Siggraph 2006, while attending a couple of very inspiring lectures given by Jason Mitchell, Gary McTaggart and Chris Green (Valve Software).

They have played over the years with a few different HDR rendering schemes and one of the key insights from their work is that we can happily de-couple exposure and tone mapping computations, deferring the latter to the next frame (actually this idea was first suggested to me by Simon Brown, that’s another story though..).

This simple concept allowed Valve guys to remove the classic full screen tone mapping rendering pass and to embed it directly in their single pass shaders using as exposure a value computed in the previous frames, thus completely eliminating the need to output HDR pixels.

Since at that time there was basically no hardware around that could handle MSAA on floating point render targets (oh gosh, just a few years ago!) they also got their HDR rendering implementation running with MSAA on relatively old hardware! Moreover their method executes tone mapping and MSAA resolve in the proper correct order (tone mapping first, followed by AA resolve) with no extra performance cost, something that a lot of modern games can’t still get right today.

If you were not aware of Valve’s method you are now probably asking yourself how they managed to compute an exposure value to be used in a tone mapping operator if no HDR data is ever dumped to the frame buffer. Through image segmentation techniques they ‘simply’ try to determine if the previous frame has been under or over exposed and a new exposure value is adjusted to compensate for problems with the previous frame(s).

While this method is very clever I have some problems with it. For instance many tone mapping operators require to determine exposure computing the average logarithmic luminance of a relevant portion of the image, but it’s not possible to reliably determine this value using Valve’s approach. HDR data is lost and while in theory we might be able to compute a plausible exposure value performing a search over multiple frames, in practice this is not easy at all. We might need to change the search direction over the exposure space to get closer to the exposure value we are looking for and this would make our image overall brightness swing back and forth for a few frames, like a pendulum around its rest configuration. Monotonic searches are possible too but they can get you only so close to the value we are looking for, especially if the image content is constantly changing!

Having debated this issue with current and former colleagues I know this is a controversial point, some agree with me, some think is not a big deal (and who knows, maybe they are even right!). On the other hand while playing Valve’s masterpieces (this method was first introduced in HL2 Lost Coast) I can’t stop noticing how sometime portions of the image are flat and seem to have lost their color details, giving me an overall flat and over or under saturated feeling (again, this is just a very personal and subjective opinion, feel free to disagree with me). This problem might be caused by a poor/overly simplified tone mapping operator (Valve games run great on not so powerful hardware and trade offs have to be made) and/or by an incorrect exposure (gotcha!).

After this long introduction I wouldn’t be surprised if you have already got my same idea: get rid of the exposure search through previous frames feedback and compute it the proper way!

The feedback/image segmentation method has been adopted because no HDR data is available, but even without re-introducing a floating point buffer (or some funky color space technique, see Christer Ericson’s blog entry about some of the work I did on Heavenly Sword and his very clever take on it) we can still generate the data we need using destination alpha. The idea is simple: compute logarithmic luminance on a per pixel basis, encode it in some special format and output it to the alpha channel. 

If we decide to support a certain luminance range [2^-minLogLum, 2^maxLogLum] we can compress and encode logarithmic lumimance in our single pass shaders using some fairly simple math:

float invLogLumRange =  1 / (maxLogLum + minLogLum);
float logLumOffset = minLogLum * invLogLumRange;
float log_luminance =  get_log_luminance( HDR_color ) * invLogLumRange + logLumOffset;

invLogLumRange and logLumOffset are constants that can precomputed so we just need a 3-way dot product, a scalar MADD and a logarithm to evaluate this formula. Explicitely clamping this expression between 0.0 and 1.0 is not necessary as the ROPs will do it for free anyway.

Since we only applied an affine transform to encode our log luminance is still correct to compute its average with multiple reduction passes as we do when we generate a mip map chain, down to a 1×1 render target, as long as we remember to invert the encoding to retrieve a proper average logarithm luminance value. Actually it’s a good idea to do this last step on the CPU (since this computation can be deferred one or two framaes we should be able to lock this specific resource and read it back with the CPU without stalling either processor) so that we can set our exposure for the next frame color pass as a pixel shader constant, removing any extra math and texture sampling from the 1×1 log luminance texture.

Unfortunately almost no trick comes for free, if we use destination alpha to encode logarithmic luminance we can’t use it for other useful operations such as alpha blending and alpha to coverage (alpha test is still doable as long as we implement it in our shaders invoking kill() or discard() ). I’m not particularly worried about alpha blending, we can simply compute our average logarithmic luminance before we render transparent objects, those won’t contribute to the exposure computations but I suspect this is not a big deal in many cases. The same trick can be applied for alpha to coverage objects, though I wouldn’t advocate it if we know we are going to render a lot of alpha coverage stuff on screen (for example think about lots of trees, it’s not probably going to work well if we are working on a Robin Hood game..)

Now we are free to implement a lot of different tone mapping operators in our single pass shaders, even if we are working on a deferred renderer, as long as its architecture can shade an opaque pixel for an arbitrary number of lights in a single pass, like in the ingenious scheme proposed by Pål-Kristian Engstad at Naughty Dog.

One last note: while I love (and I always will..) finding new and unexpected ways to use graphics hardware it’s clear to me things are going to change soon, very soon. Shaders allow us to do almost anything, but they are still encapsulated in a rendering pipeline that dates back to the late 80s and that has gone almost unchanged for the last twenty years. When I was a student I once used to write my own rendering pipeline (my beloved Amiga didn’t have a GPU..) which wasn’t always based on z-buffer and rasterization (though I wrote so many rasterizers I lost count of them..) and I’m glad of the cyclical nature of hardware development as we are now about to go back to the future and once again develop our own custom rendering architecture on top of recent years advancements. Only this time is going to be even more fun!

Why GPUs are not (so) good at post processing images

Now I know the title of this post may sound controversial, but keep on reading, you might even end up agreeing with me.
Post processing effects are often implemented via convolutions, reductions and more sophisticated filters such as bilateral filters. What is interesting here is that the vast majority of these techniques all have something in common: data needed to work on a pixel are often required to work on neighboring pixels as well. This is an important property which is usually referred as ‘data locality’ and GPUs exploit it through texture caches.

Pixel shaders programming model has been designed to enable maximum parallelism. Each pixel is processed independently from any other pixel and no intercommunication is possible within a rendering pass. This model allows to pack an high number of processors on a single chip without having to worry about their mutual interactions, as no data synchronization between them is necessary..ever! This is also why the only atomic operations implicitly supported by current GPUs are related to alpha blending.

While this model has been incredibly successful it’s not exactly optimal when we would like to share data and computations across many pixels. As a practical example we can consider a simple NxN separable box filter which can be implemented on a GPU in two passes; each pass requires to fetch N samples and perform N-1 MADDs (or N-1 ADDs and one MUL) in order to compute a filtered pixel. This way to process data is clearly sub-optimal as while we move over the image horizontally or vertically each not-yet-filtered pixel shares N-1 neighboring pixels with the previously filtered pixel. Why wouldn’t we want to re-use those informations? Unfortunately we can’t, on a GPU..

..On the other hand modern multi-core CPUs can easily outperform GPUs in this case. Each core could be assigned to not overlapping areas of the image, moreover a pixel can accumulate its filtered value in a register and pass it to the next pixel.
At each iteration we wouldn’t need to fetch N samples anymore, we can in fact re-use the value generated by the previous iteration. As the filter kernel moves horizontally or vertically over the image at each new step an old pixel as to be removed from the accumulator and a new one can be added to it. Like in a fifo buffer we remove the first entry in the fifo and we add a new one (the fifo can be managed via registers rotations or using a modulo N-1 pointer to an array of values). At each step only one new value has to be read from memory (two if we implement it with a rotating pointer) and only one value has to be written to memory (or two again..). An addition, a subtraction and a multiplication would be needed per iteration to filter a pixel, no matter how wide our box filter is! More complex separable filters as tent or gaussian filters can be easily implemented multi-passing the work done on a column or a row of the image two or three times; for large filter kernels this is going to be incredibly fast as again the computation time does not depend by the filter width anymore

On a processor like CELL a single SPU would probably be able to perform such a task on single precision pixels at 4-5 cycles per pixel with fully pipelined and unrolled loops (data transfer from and to external memory are handled via asynchronous DMA operations). 6 SPUs could apply a box filter with ANY kernel size (even the whole screen!) on a 720p image over 40 times per frame, at 60 frames per second, and consuming only a relatively small amount of memory bandwidth as most read and write operations would happen on chip. This is admittedly a best case scenario, something slightly more complex would be necessary to also have a very good accuracy on the filtered data, but it’s clear that an architecture like CELL is very well versed in this kind of data processing. I hope to see more and more games in the future implementing their full post processing pipeline on CELL and off loading this way a ton of work from RSX that could then spend more time doing what it does best (e.g. shading pixels which don’t share a great amount of data and computations!)

It’s interesting to note that the last GPU architecture from NVIDIA (G8x) was also designed to be more efficient at tasks as image post processing. For the first time developers can leverage on a small on chip memory which is shared by a cluster of ALUs (each G8x GPU has a variable number of ALUs clusters..), each cluster has its own shared on chip memory which is split over multiple banks. As each memory bank can only serve one ALU per clock cycle it’s not certainly straightforward to make a group of processors use this memory without incurring in memory stalls or synchronization issues, nonetheless it looks we are at least going toward the right direction. Unfortunately at the time I’m writing this I’m not aware of any way to use these capabilities via some graphics API as D3D or OpenGL.

So.. do you still think GPUs rock at image post processing?;)
Feel free to post your own take on the subject, rants and errata corrige.