Ars Technica’s recently published interview with game developer extraordinaire Tim Sweeney has given me the perfect excuse to finally sit down and write a few thoughts on the future of GPUs and real-time graphics in general.
In his interview Mr. Sweeney makes some interesting points about the next generation of graphics hardware & software architectures:
- 3D APIs as we know them are a thing of the past and will soon die, replaced by more flexible software rendering pipelines implemented with CUDA/Compute Shaders/OpenCL or other languages.
- (Some) fixed function/not programmable hardware units will still make sense for the foreseeable future.
- DirectX 9 has been the last important revolution in 3D APIs, everything that followed or that will come next won’t have such a dramatic impact on computer graphics engineers and researchers’ life.
- A good auto-vectorizing C++ compiler on all next gen platforms is perhaps all we need, developers will take care of the rest.
- Next gen consoles might be based on a single massively parallel IC with general purpose computing capabilities plus some fixed function hardware units to speed up certain graphics related tasks such as texture mapping or rasterization.
Regarding the first point I believe Tim Sweeney’s view is quite optimistic, many developers will neither be able nor interested in implementing their own rendering pipeline. 3D APIs, as we know them, will perhaps slowly lose their relevance, though I don’t think their premature death is going to happen anytime soon.
There is some chance that all present and future big players in the market (namely NVIDIA, AMD, Intel and Microsoft) will agree on a common way to ‘hijack’ the current 3D pipeline, allowing developers to add new stages and to bypass old ones. This might sound like a good option for whoever wants to be creative without having to entirely lose the benefits of something which is proven, works well and can be efficiently re-used. It might even open a whole new world of possibilities for middle-ware developers.
Fixed Function What?
For all those skilled in the arts point number two is a no brainer. For example TMUs’ dedicated logic performs tasks such as texture addressing, fetching, de-compression and filtering; If you have ever written a software renderer then you have experienced first hand how most of these operations are not amenable to be easily and efficiently implemented in software.
Custom rasterization hardware won’t likely disappear that soon either. Even Intel, that will not employ rasterization logic on Larrabee, agrees that “rasterization is unquestionably more efficient in dedicated logic than in software when running at peak rates“. That’s why fixed function hardware will likely stay with us for many years to come.
It’s interesting to notice how with Larrabee Intel got rid of a long list of dedicated hardware blocks that have been part of GPUs for a long time. Here’s a list of the most important ones:
- input assembly (already implemented in software on some GPUs).
- pre and post transformed vertex caches.
- primitive assembly, culling & setup.
- hierarchical z-buffer.
- attributes interpolation (partially implemented in software on NVIDIA GPUs).
- all output merge stages: alpha/stencil/depth tests, blending, alpha to coverage, etc.
- color, z and stencil compression.
- a plethora of obscure and relatively small fifos and caches.
Intel has clearly made a bold move here. They are taking huge risks and only time (and competition from other companies) will tell whether they are right or not.
Their software renderer seems to be incredibly well architected and it’s a pity we had to wait so many years to see a big player adopting a tile based deferred renderer. One of the few advantages of TBDRs over immediate mode renderers is that they can be more efficient at using programmable hardware and memory bandwidth, making some dedicated logic unnecessary. Say goodbye to color and z compression, and don’t forget to commemorate output merge stages (aka ROPs) for all the good work they have done over the last 15-20 years!
Unfortunately we all know that nothing is for free and increased flexibility will come at a certain cost (this kind of bills are usually paid in perf/mm2 and perf/watt). On the other hand, giving up a big chunk of often idle dedicated logic is a great way to have more & more programmable hardware on board, which is inherently less likely to be inactive at any given time. A simple picture of NVIDIA GT200 can give a rough idea of how much area is spent on fixed function units, as you can see at least a fourth of the chip area is devoted to non programmable hardware.
DirectX 9 was a huge step in the right direction, and DirectX 10 is helping consolidating that step adopting new render states, driver and unified shading models. In contrast, for a variety of reasons that go from <what am I supposed to do with this?> to <it’s not a very clean design> I am not exactly enamored with DX10′s geometry shaders or DX11′s three brand new tessellation stages. I think these recent developments show us that as we enter in partially uncharted territory we don’t know yet which direction should be taken.
That’s why as we move towards more flexible and open rendering pipelines computer graphics researchers and game developers will unleash their imagination and come up with new interesting ideas. We will certainly see old but high profile graphics research brought back to life again (A-buffer anyone?) and used in real-time applications such as video games. Perhaps in ten years or more, after long and fruitful experimentation, we will settle down for a new and specific rendering pipeline model and it will be “The Wheel of Reincarnation” all over again!
Is CUDA good enough?
We will soon have at least three different CUDA-like languages to play with: CUDA, OpenCL and DX11′s compute shaders and each of them seem to be well versed in exploiting data level parallelism. Sweeney thinks we can fully implement a modern rendering pipeline with languages like C++ or CUDA, though I have a couple of concerns about CUDA-like languagues:
- CUDA memory model is complex and it’s tied to NVIDIA hardware. Will it scale well on future hardware?
- Many algorithms map poorly to DLP.
Conversely I expect CUDA and its younger siblings to evolve quite rapidly and embrace other forms of parallelism (it seems OpenCL will support some sort of thread level parallelism..), and here lies my hope to see some major innovation in this area. Speaking of which Intel is also working on the Ct programming language that promises to breathe new life into the nested data parallel programming paradigm. Notice how all these new languages are based on dynamic JIT-style compilers: a necessary step in order to abstract code from specific hardware quirks, to maintain compatibility across the board and ensure scalability over next hardware generations.
Tim Sweeney also advocates the use of auto-vectorizing compilers, which occurs to me tend to be effective only at exploiting DLP and not much else. That’s perfect for pixel shaders et similia, not so good for all sort of tasks that don’t need to work on a zillion entities or that need some sort of control on how threads are created, scheduled and destroyed (unless you are brave enough to manually manage dozens or even hundreds of threads).
Can One Chip Rule Them All?
Following Mr. Swenney’s suggestion: how likely is to have in a few years a first game console entirely based on a single chip or at least on a single massively parallel architecture? I don’t want to dig too much into this extremely interesting topic as I would like to discuss it at length in a future post, but let me say that what is in the realm of possibility is not always feasible (yet).
A Glimpse Of The Future
In this long post I have been talking extensively about a future where a rendering pipeline is more general, flexible and less tied to a specific hardware implementation, so it is perhaps time to show what this all means in terms of real change. I don’t want to take in consideration particularly exotic and unproven stuff, as I believe there is a lot of cool work to be done without having to throw the metaphorical baby out with the water!
For instance, it occurred to me many times that there is nothing inherently special in a stencil buffer that diversifies it from a color buffer or a z-buffer, unless we take in consideration the status it assumes in the rendering pipeline thanks to the stencil test. While fifteen years ago made perfect sense to have such an hardwired capability, now it feels more like an old gimmick that was not improved over the years while the rest of the pipeline was getting more modern and flexible.
Since we are at it what about alpha test, alpha blending and alpha-to-coverage? Why is the stencil buffer just using 8 bit per pixel? Why is the set of operations it supports so limited? Why can’t I have my own special alpha blending operations? And most of all do these old features still make sense?
Of course they do, I use them all the time! But as it happens to many other engineers and researchers I find myself fighting them on a daily basis while trying to bypass their awkward limitations. There is clear lack of generality and orthogonality with respect to the rest of the pipeline, and that’s why I am convinced that the whole set of output merge stages need to be re-architect-ed. We know that as the hardware evolves it gets rid of fixed function units, but these changes won’t automagically fix the software layers that go on top of it.
It would be nice if we could remove these features:
- stencil buffer & stencil test
- alpha blending, alpha test and alpha to coverage.
- can be invoked before and/or after fragment shading
- can read from and write to all render targets
- can kill a fragment and/or generate a coverage mask for it (to avoid aliasing..)
For example a stencil buffer would be just another render target (don’t forget we had support for multiple render targets for years now) and these shaders could be automatically linked by the driver to the main fragment shader or kept separate and executed in multiple stages. I have to admit that while I’m writing these few lines I’m having something like Larrabee and its software renderer in mind, but I wouldn’t be surprised if in two years from now the rest of the graphics hardware landscape ends up being much more similar to Larrabee than current GPUs.
Even barring incidental display devices breakthroughs I believe no one knows for sure how we will do graphics in 10-15 years from now. That’s why it is hard to disagree with Mr. Sweeney when he notes that the next few years are going to be very exciting for engineers and researchers!