Michael Abrash’s First Look at the Larrabee New Instructions (LRBni)

Dr. Dobb’s just published the first of a new series of detailed articles related to Larrabee hardware and software architecture written by Michael Abrash.

A First Look at the Larrabee New Instructions (LRBni)

Larrabee is an architecture, rather than a product, with three distinct aspects — many cores, many threads, and a new vector instruction set — that boost performance. This architecture will first be used in GPUs, and could be used in CPUs as well.


C++ implementation of Larrabee new instructions

Intel has recently released a C++ implementation of Larrabee new instructions (LRBni). Grab it if you like to be ahead of the curve starting to write software for LRB right now,  or if you simply want to  have a glimpse of LRB architecture.

Prototype Primitives Guide



This .inl file provides a C++-implementation of the Larrabee new instructions.  It allows developers to experiment with developing Larrabee code without a Larrabee compiler and without Larrabee hardware. It does not attempt to match the Larrabee new instructions with respect to exceptions, flags, bit-precision, or memory alignment restrictions. Disclaimer: the exact syntax and semantics of the functions shown here are not guaranteed to be supported in future Larrabee hardware and software products.


A peek into the future of interactive computer graphics

Ars Technica’s recently published interview with game developer extraordinaire Tim Sweeney has given me the perfect excuse to finally sit down and write a few thoughts on the future of GPUs and real-time graphics in general.

In his interview Mr. Sweeney makes some interesting points about the next generation of graphics hardware & software architectures:

  1. 3D APIs as we know them are a thing of the past and will soon die, replaced by more flexible software rendering pipelines implemented with CUDA/Compute Shaders/OpenCL or other languages.
  2. (Some) fixed function/not programmable hardware units will still make sense for the foreseeable future.
  3. DirectX 9 has been the last important revolution in 3D APIs, everything that followed or that will come next won’t have such a dramatic impact on computer graphics engineers and researchers’ life.
  4. A good auto-vectorizing C++ compiler on all next gen platforms is perhaps all we need, developers will take care of the rest.
  5. Next gen consoles might be based on a single massively parallel IC with general purpose computing capabilities plus some fixed function hardware units to speed up certain graphics related tasks such as texture mapping or rasterization.


Regarding the first point I believe Tim Sweeney’s view is quite optimistic, many developers will neither be able nor interested in implementing their own rendering pipeline. 3D APIs, as we know them, will perhaps slowly lose their relevance, though I don’t think their premature death is going to happen anytime soon.

There is some chance that all present and future big players in the market (namely NVIDIA, AMD, Intel and Microsoft) will agree on a common way to ‘hijack’ the current 3D pipeline, allowing developers to add new stages and to bypass old ones. This might sound like a good option for whoever wants to be creative without having to entirely lose the benefits of something which is proven, works well and can be efficiently re-used. It might even open a whole new world of possibilities for middle-ware developers.

Fixed Function What?

For all those skilled in the arts point number two is a no brainer. For example TMUs’ dedicated logic performs tasks such as texture addressing, fetching, de-compression and filtering; If you have ever written a software renderer then you have experienced first hand how most of these operations are not amenable to be easily and efficiently implemented in software.

Custom rasterization hardware won’t likely disappear that soon either. Even Intel, that will not employ rasterization logic on Larrabee, agrees that “rasterization is unquestionably more efficient in dedicated logic than in software when running at peak rates“. That’s why fixed function hardware will likely stay with us for many years to come.

It’s interesting to notice how with Larrabee Intel got rid of a long list of dedicated hardware blocks that have been part of GPUs for a long time. Here’s a list of the most important ones:

  • input assembly (already implemented in software on some GPUs).
  • pre and post transformed vertex caches.
  • primitive assembly, culling & setup.
  • hierarchical z-buffer.
  • rasterization.
  • attributes interpolation (partially implemented in software on NVIDIA GPUs).
  • all output merge stages: alpha/stencil/depth tests, blending, alpha to coverage, etc.
  • color, z and stencil compression.
  • a plethora of obscure and relatively small fifos and caches.

Intel has clearly made a bold move here. They are taking huge risks and only time (and competition from other companies) will tell whether they are right or not.

Their software renderer seems to be incredibly well architected and it’s a pity we had to wait so many years to see a big player adopting a tile based deferred renderer. One of the few advantages of TBDRs over immediate mode renderers is that they can be more efficient at using programmable hardware and memory bandwidth, making some  dedicated logic unnecessary. Say goodbye to color and z compression, and don’t forget to commemorate output merge stages (aka ROPs) for all the good work they have done over the last 15-20 years!

Unfortunately we all know that nothing is for free and increased flexibility will come at a certain cost (this kind of bills are usually paid in perf/mm2 and perf/watt). On the other hand, giving up a big chunk of often idle dedicated logic is a great way to have more & more programmable hardware on board, which is inherently less likely to be inactive at any given time. A simple picture of NVIDIA GT200 can give a rough idea of how much area is spent on fixed function units, as you can see at least a fourth of the chip area is devoted to non programmable hardware.


DirectX 9 was a huge step in the right direction, and DirectX 10 is helping consolidating that step adopting new render states, driver and unified shading models. In contrast, for a variety of reasons that go from <what am I supposed to do with this?> to <it’s not a very clean design> I am not exactly enamored with DX10’s geometry shaders or DX11’s three brand new tessellation stages. I think these recent developments show us that as we enter in partially uncharted territory we don’t know yet which direction should be taken.

That’s why as we move towards more flexible and open rendering pipelines computer graphics researchers and game developers will unleash their imagination and come up with new interesting ideas. We will certainly see old but high profile graphics research brought back to life again (A-buffer anyone?) and used in real-time applications such as video games. Perhaps in ten years or more, after long and fruitful experimentation, we will settle down for a new and specific rendering pipeline model and it will be “The Wheel of Reincarnation” all over again!

Is CUDA good enough?

We will soon have at least three different CUDA-like languages to play with: CUDA, OpenCL and DX11’s compute shaders and each of them seem to be well versed in exploiting data level parallelism. Sweeney thinks we can fully implement a modern rendering pipeline with languages like C++ or CUDA, though I have a couple of concerns about CUDA-like languagues:

  • CUDA memory model is complex and it’s tied to NVIDIA hardware. Will it scale well on future hardware?
  • Many algorithms map poorly to DLP.

Conversely I expect CUDA and its younger siblings to evolve quite rapidly and embrace other forms of parallelism (it seems OpenCL will support some sort of thread level parallelism..), and here lies my hope to see some major innovation in this area. Speaking of which Intel is also working on the Ct programming language that promises to breathe new life into the nested data parallel programming paradigm. Notice how all these new languages are based on dynamic JIT-style compilers: a necessary step in order to abstract code from specific hardware quirks, to maintain compatibility across the board and ensure scalability over next hardware generations.

Tim Sweeney also advocates the use of auto-vectorizing compilers, which occurs to me tend to be effective only at exploiting DLP and not much else. That’s perfect for pixel shaders et similia, not so good for all sort of tasks that don’t need to work on a zillion entities or that need some sort of control on how threads are created, scheduled and destroyed (unless you are brave enough to manually manage dozens or even hundreds of threads).

Can One Chip Rule Them All?

Following Mr. Swenney’s suggestion: how likely is to have in a few years a first game console entirely based on a single chip or at least on a single massively parallel architecture? I don’t want to dig too much into this extremely interesting topic as I would like to discuss it at length in a future post, but let me say that what is in the realm of possibility is not always feasible (yet).

A Glimpse Of The Future

In this long post I have been talking extensively about a future where a rendering pipeline is more general, flexible and less tied to a specific hardware implementation, so it is perhaps time to show what this all means in terms of real change. I don’t want to take in consideration particularly exotic and unproven stuff, as I believe there is a lot of cool work to be done without having to throw the metaphorical baby out with the water!

For instance, it occurred to me many times that there is nothing inherently special in a stencil buffer that diversifies it from a color buffer or a z-buffer, unless we take in consideration the status it assumes in the rendering pipeline thanks to the stencil test. While fifteen years ago made perfect sense to have such an hardwired capability, now it feels more like an old gimmick that was not improved over the years while the rest of the pipeline was getting more modern and flexible.

Since we are at it what about alpha test, alpha blending and alpha-to-coverage? Why is the stencil buffer just using 8 bit per pixel? Why is the set of operations it supports so limited? Why can’t I have my own special alpha blending operations? And most of all do these old features still make sense?

Of course they do, I use them all the time! But as it happens to many other engineers and researchers I find myself fighting them on a daily basis while trying to bypass their awkward limitations. There is clear lack of generality and orthogonality with respect to the rest of the pipeline, and that’s why I am convinced that the whole set of output merge stages need to be re-architect-ed. We know that as the hardware evolves it gets rid of fixed function units, but these changes won’t automagically fix the software layers that go on top of it.

It would be nice if we could remove these features:

  • stencil buffer & stencil test
  • alpha blending, alpha test and alpha to coverage.
and replace them with generic shaders that:
  • can be invoked before and/or after fragment shading
  • can read from and write to all render targets
  • can kill a fragment and/or generate a coverage mask for it (to avoid aliasing..)

For example a stencil buffer would be just another render target (don’t forget we had support for multiple render targets for years now) and these shaders could be automatically linked by the driver to the main fragment shader or kept separate and executed in multiple stages. I have to admit that while I’m writing these few lines I’m having something like Larrabee and its software renderer in mind, but I wouldn’t be surprised if in two years from now the rest of the graphics hardware landscape ends up being much more similar to Larrabee than current GPUs.

Final Words

Even barring incidental display devices breakthroughs I believe no one knows for sure how we will do graphics in 10-15 years from now. That’s why it is hard to disagree with Mr. Sweeney when he notes that the next few years are going to be very exciting for engineers and researchers!

Another day, another HDR rendering trick and some hope for the future.

Today I’m going to talk about an idea I came up with in Boston, at Siggraph 2006, while attending a couple of very inspiring lectures given by Jason Mitchell, Gary McTaggart and Chris Green (Valve Software).

They have played over the years with a few different HDR rendering schemes and one of the key insights from their work is that we can happily de-couple exposure and tone mapping computations, deferring the latter to the next frame (actually this idea was first suggested to me by Simon Brown, that’s another story though..).

This simple concept allowed Valve guys to remove the classic full screen tone mapping rendering pass and to embed it directly in their single pass shaders using as exposure a value computed in the previous frames, thus completely eliminating the need to output HDR pixels.

Since at that time there was basically no hardware around that could handle MSAA on floating point render targets (oh gosh, just a few years ago!) they also got their HDR rendering implementation running with MSAA on relatively old hardware! Moreover their method executes tone mapping and MSAA resolve in the proper correct order (tone mapping first, followed by AA resolve) with no extra performance cost, something that a lot of modern games can’t still get right today.

If you were not aware of Valve’s method you are now probably asking yourself how they managed to compute an exposure value to be used in a tone mapping operator if no HDR data is ever dumped to the frame buffer. Through image segmentation techniques they ‘simply’ try to determine if the previous frame has been under or over exposed and a new exposure value is adjusted to compensate for problems with the previous frame(s).

While this method is very clever I have some problems with it. For instance many tone mapping operators require to determine exposure computing the average logarithmic luminance of a relevant portion of the image, but it’s not possible to reliably determine this value using Valve’s approach. HDR data is lost and while in theory we might be able to compute a plausible exposure value performing a search over multiple frames, in practice this is not easy at all. We might need to change the search direction over the exposure space to get closer to the exposure value we are looking for and this would make our image overall brightness swing back and forth for a few frames, like a pendulum around its rest configuration. Monotonic searches are possible too but they can get you only so close to the value we are looking for, especially if the image content is constantly changing!

Having debated this issue with current and former colleagues I know this is a controversial point, some agree with me, some think is not a big deal (and who knows, maybe they are even right!). On the other hand while playing Valve’s masterpieces (this method was first introduced in HL2 Lost Coast) I can’t stop noticing how sometime portions of the image are flat and seem to have lost their color details, giving me an overall flat and over or under saturated feeling (again, this is just a very personal and subjective opinion, feel free to disagree with me). This problem might be caused by a poor/overly simplified tone mapping operator (Valve games run great on not so powerful hardware and trade offs have to be made) and/or by an incorrect exposure (gotcha!).

After this long introduction I wouldn’t be surprised if you have already got my same idea: get rid of the exposure search through previous frames feedback and compute it the proper way!

The feedback/image segmentation method has been adopted because no HDR data is available, but even without re-introducing a floating point buffer (or some funky color space technique, see Christer Ericson’s blog entry about some of the work I did on Heavenly Sword and his very clever take on it) we can still generate the data we need using destination alpha. The idea is simple: compute logarithmic luminance on a per pixel basis, encode it in some special format and output it to the alpha channel. 

If we decide to support a certain luminance range [2^-minLogLum, 2^maxLogLum] we can compress and encode logarithmic lumimance in our single pass shaders using some fairly simple math:

float invLogLumRange =  1 / (maxLogLum + minLogLum);
float logLumOffset = minLogLum * invLogLumRange;
float log_luminance =  get_log_luminance( HDR_color ) * invLogLumRange + logLumOffset;

invLogLumRange and logLumOffset are constants that can precomputed so we just need a 3-way dot product, a scalar MADD and a logarithm to evaluate this formula. Explicitely clamping this expression between 0.0 and 1.0 is not necessary as the ROPs will do it for free anyway.

Since we only applied an affine transform to encode our log luminance is still correct to compute its average with multiple reduction passes as we do when we generate a mip map chain, down to a 1×1 render target, as long as we remember to invert the encoding to retrieve a proper average logarithm luminance value. Actually it’s a good idea to do this last step on the CPU (since this computation can be deferred one or two framaes we should be able to lock this specific resource and read it back with the CPU without stalling either processor) so that we can set our exposure for the next frame color pass as a pixel shader constant, removing any extra math and texture sampling from the 1×1 log luminance texture.

Unfortunately almost no trick comes for free, if we use destination alpha to encode logarithmic luminance we can’t use it for other useful operations such as alpha blending and alpha to coverage (alpha test is still doable as long as we implement it in our shaders invoking kill() or discard() ). I’m not particularly worried about alpha blending, we can simply compute our average logarithmic luminance before we render transparent objects, those won’t contribute to the exposure computations but I suspect this is not a big deal in many cases. The same trick can be applied for alpha to coverage objects, though I wouldn’t advocate it if we know we are going to render a lot of alpha coverage stuff on screen (for example think about lots of trees, it’s not probably going to work well if we are working on a Robin Hood game..)

Now we are free to implement a lot of different tone mapping operators in our single pass shaders, even if we are working on a deferred renderer, as long as its architecture can shade an opaque pixel for an arbitrary number of lights in a single pass, like in the ingenious scheme proposed by Pål-Kristian Engstad at Naughty Dog.

One last note: while I love (and I always will..) finding new and unexpected ways to use graphics hardware it’s clear to me things are going to change soon, very soon. Shaders allow us to do almost anything, but they are still encapsulated in a rendering pipeline that dates back to the late 80s and that has gone almost unchanged for the last twenty years. When I was a student I once used to write my own rendering pipeline (my beloved Amiga didn’t have a GPU..) which wasn’t always based on z-buffer and rasterization (though I wrote so many rasterizers I lost count of them..) and I’m glad of the cyclical nature of hardware development as we are now about to go back to the future and once again develop our own custom rendering architecture on top of recent years advancements. Only this time is going to be even more fun!