<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Why GPUs are not (so) good at post processing images</title>
	<atom:link href="http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/feed/" rel="self" type="application/rss+xml" />
	<link>http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/</link>
	<description></description>
	<lastBuildDate>Mon, 12 Oct 2009 22:17:15 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Marco Salvi</title>
		<link>http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-118</link>
		<dc:creator>Marco Salvi</dc:creator>
		<pubDate>Sat, 22 Mar 2008 14:58:03 +0000</pubDate>
		<guid isPermaLink="false">http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-118</guid>
		<description>Hi Cristina,

It&#039;s not easy to find some sort of public documentation or paper that shows how a texture cache work in a GPU, most of the stuff I read so far is covered by some nda.
As fair as I understand this work:
http://portal.acm.org/citation.cfm?id=264152&amp;dl=GUIDE&amp;dl=ACM
describes a cache architecture which is kind of similar to what you can find in modern GPUs.

Marco</description>
		<content:encoded><![CDATA[<p>Hi Cristina,</p>
<p>It&#8217;s not easy to find some sort of public documentation or paper that shows how a texture cache work in a GPU, most of the stuff I read so far is covered by some nda.<br />
As fair as I understand this work:<br />
<a href="http://portal.acm.org/citation.cfm?id=264152&amp;dl=GUIDE&amp;dl=ACM" rel="nofollow">http://portal.acm.org/citation.cfm?id=264152&amp;dl=GUIDE&amp;dl=ACM</a><br />
describes a cache architecture which is kind of similar to what you can find in modern GPUs.</p>
<p>Marco</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Cristina</title>
		<link>http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-117</link>
		<dc:creator>Cristina</dc:creator>
		<pubDate>Mon, 17 Mar 2008 17:08:11 +0000</pubDate>
		<guid isPermaLink="false">http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-117</guid>
		<description>Dear Mr.,

I am looking for good material about texture caching on GPU. I have some doubts about how it is actually done, if by lines/columns or regions. Can you indicate me some?

Thanks for your help,
Cristina</description>
		<content:encoded><![CDATA[<p>Dear Mr.,</p>
<p>I am looking for good material about texture caching on GPU. I have some doubts about how it is actually done, if by lines/columns or regions. Can you indicate me some?</p>
<p>Thanks for your help,<br />
Cristina</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Manny Ko</title>
		<link>http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-115</link>
		<dc:creator>Manny Ko</dc:creator>
		<pubDate>Thu, 06 Mar 2008 00:00:27 +0000</pubDate>
		<guid isPermaLink="false">http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-115</guid>
		<description>I cannot agree with you more Marco.</description>
		<content:encoded><![CDATA[<p>I cannot agree with you more Marco.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: purpledog</title>
		<link>http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-60</link>
		<dc:creator>purpledog</dc:creator>
		<pubDate>Mon, 24 Sep 2007 10:50:36 +0000</pubDate>
		<guid isPermaLink="false">http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-60</guid>
		<description>I&#039;d like to have some number with non separable kernel as well... All the explanation with the &quot;apron&quot; is really very interesting (this guy knows how to make clear figures) but I was kind of dissapointed to realize that the code was actually 1D (convolutionSeparable.pdf, page 9).</description>
		<content:encoded><![CDATA[<p>I&#8217;d like to have some number with non separable kernel as well&#8230; All the explanation with the &#8220;apron&#8221; is really very interesting (this guy knows how to make clear figures) but I was kind of dissapointed to realize that the code was actually 1D (convolutionSeparable.pdf, page 9).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ignacio Castano</title>
		<link>http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-52</link>
		<dc:creator>Ignacio Castano</dc:creator>
		<pubDate>Sat, 22 Sep 2007 01:02:49 +0000</pubDate>
		<guid isPermaLink="false">http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-52</guid>
		<description>Andrew: I wouldn&#039;t say that NVIDIA&#039;s CUDA is a &quot;low-level&quot; API, quite the opposite. It lets you use a high level language derived from C. There are just some extensions to indicate whether functions are compiled for the device or the host, and to tell the compiler in what memory to allocate variables.

It&#039;s possible to use shared memory to optimize image processing operations they way Marco describes, in fact, we have a few SDK examples showing that:

http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html

In particular, the separable convolution example comes with a white-paper that shows how to use shared memory to save bandwidth by loading elements shared by different pixels only once:

http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/convolutionSeparable/doc/convolutionSeparable.pdf</description>
		<content:encoded><![CDATA[<p>Andrew: I wouldn&#8217;t say that NVIDIA&#8217;s CUDA is a &#8220;low-level&#8221; API, quite the opposite. It lets you use a high level language derived from C. There are just some extensions to indicate whether functions are compiled for the device or the host, and to tell the compiler in what memory to allocate variables.</p>
<p>It&#8217;s possible to use shared memory to optimize image processing operations they way Marco describes, in fact, we have a few SDK examples showing that:</p>
<p><a href="http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html" rel="nofollow">http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html</a></p>
<p>In particular, the separable convolution example comes with a white-paper that shows how to use shared memory to save bandwidth by loading elements shared by different pixels only once:</p>
<p><a href="http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/convolutionSeparable/doc/convolutionSeparable.pdf" rel="nofollow">http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/convolutionSeparable/doc/convolutionSeparable.pdf</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Marco Salvi</title>
		<link>http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-51</link>
		<dc:creator>Marco Salvi</dc:creator>
		<pubDate>Sat, 22 Sep 2007 00:57:52 +0000</pubDate>
		<guid isPermaLink="false">http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-51</guid>
		<description>Hello Andy, thanks again for your insights!
Until the other day I didn&#039;t know much about RapidMind but I recently had the chance to attend a a small presentation about the technology you work on and I must say I was really impressed. It seems a lot of scientists in RapidMind do like CELL :) 
I&#039;m not sure if what I saw/heard is covered by some NDA so I won&#039;t talk about it here but AFAIK all the other companies working in the same field (e.g. Peakstream, Codeplay,etc.. ) are way behind you guys.
I totally agree with you that there&#039;s really no reason to expose CELL hw in the current &#039;naive&#039; way, we need far better abstractions for it (and not just cause a lot of programmers, at least in my industry, just don&#039;t get it..no matter how hard you try to explain to them how it works...).</description>
		<content:encoded><![CDATA[<p>Hello Andy, thanks again for your insights!<br />
Until the other day I didn&#8217;t know much about RapidMind but I recently had the chance to attend a a small presentation about the technology you work on and I must say I was really impressed. It seems a lot of scientists in RapidMind do like CELL <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /><br />
I&#8217;m not sure if what I saw/heard is covered by some NDA so I won&#8217;t talk about it here but AFAIK all the other companies working in the same field (e.g. Peakstream, Codeplay,etc.. ) are way behind you guys.<br />
I totally agree with you that there&#8217;s really no reason to expose CELL hw in the current &#8216;naive&#8217; way, we need far better abstractions for it (and not just cause a lot of programmers, at least in my industry, just don&#8217;t get it..no matter how hard you try to explain to them how it works&#8230;).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew Lauritzen</title>
		<link>http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-45</link>
		<dc:creator>Andrew Lauritzen</dc:creator>
		<pubDate>Mon, 17 Sep 2007 15:18:24 +0000</pubDate>
		<guid isPermaLink="false">http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-45</guid>
		<description>Arun: Certainly the overhead of using a cache is something to be considered, but as I&#039;m not a hardware guy I can&#039;t really discuss the tradeoffs there. On G80 there definitely is a cost as you note to texture lookups even if they are cached (I&#039;ve actually found that cost to be even higher on ATI cards, but I&#039;ve not had a lot of time to play with the R600). It shouldn&#039;t be overstated though as we routinely see programs that access hundreds (and sometimes thousands!) of different elements, and the G80 still crunches through that fairly efficiently.

Marco: The person I spoke to may have been one of the authors on that paper, but unfortunately I don&#039;t remember his name :(

As I mentioned, it&#039;s easy to construct/observe cases (even simple convolution) where its possible to use software-managed local memories to get speed increases (usually only on the order of 1-3 times in my experience, but that can be significant). Conversely however it&#039;s also easy to construct cases that perform poorly on Cell in particular. For example, programs with random-access reads of global memory - particularly within tight control flow - perform poorly compared to GPUs, even with a software texture cache in local memory. Thus things like (fragment) shading on the SPUs is actually fairly inefficient, although certainly GPUs have been optimized for this case :)

It&#039;s true that box filters used recursively can do many cool things (approaching gaussian kernels), but actually the &quot;specialness&quot; of that case has one other implication: there are better ways to evaluate medium-to-large box filters than &quot;brute force&quot;. I&#039;d argue that the &quot;equivalent&quot; way (to your optimized local memories example) to evaluate a box filter on GPUs for instance is to build a summed-area table (separable parallel scan -&gt; O(N) for NxN elements) and then use it to evaluate arbitrarily large filters (O(1) per element). Indeed this implementation will probably surpass the speed of even the optimized &quot;local memories&quot; version for sufficiently large filters, and boasts excellent memory coherence for the case of constant-sized filters.

That said, the most efficient way to implement &quot;scan&quot; on GPUs right now (at least in CUDA) involves using the local memories :)

Anyways I&#039;m still a fan of software-managed caches (particularly at RapidMind we often *know* what data is coming up next, and thus can use - for example - Cell hardware very efficiently while we just have to trust GPUs to &quot;do the right thing), but it&#039;s not clear to me that they are necessarily superior on a theoretical level. They may be, but I need more convincing :)

The other orthogonal question is one of programming models. As I alluded to in the previous reply, there&#039;s no reason why local memories have to be exposed as they are on Cell. Indeed there are nicer programming abstractions that can very effectively scale to N-level memory hierarchies without any application-programmer intervention. It may yet be another case where a higher-level abstraction will eventually lead to more efficient code, once the compilers are sufficiently mature.

Anyways, interesting stuff :) Your blog so far has a 100% hit ratio on extremely relevant and thought-provoking material... keep it up!</description>
		<content:encoded><![CDATA[<p>Arun: Certainly the overhead of using a cache is something to be considered, but as I&#8217;m not a hardware guy I can&#8217;t really discuss the tradeoffs there. On G80 there definitely is a cost as you note to texture lookups even if they are cached (I&#8217;ve actually found that cost to be even higher on ATI cards, but I&#8217;ve not had a lot of time to play with the R600). It shouldn&#8217;t be overstated though as we routinely see programs that access hundreds (and sometimes thousands!) of different elements, and the G80 still crunches through that fairly efficiently.</p>
<p>Marco: The person I spoke to may have been one of the authors on that paper, but unfortunately I don&#8217;t remember his name <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> </p>
<p>As I mentioned, it&#8217;s easy to construct/observe cases (even simple convolution) where its possible to use software-managed local memories to get speed increases (usually only on the order of 1-3 times in my experience, but that can be significant). Conversely however it&#8217;s also easy to construct cases that perform poorly on Cell in particular. For example, programs with random-access reads of global memory &#8211; particularly within tight control flow &#8211; perform poorly compared to GPUs, even with a software texture cache in local memory. Thus things like (fragment) shading on the SPUs is actually fairly inefficient, although certainly GPUs have been optimized for this case <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>It&#8217;s true that box filters used recursively can do many cool things (approaching gaussian kernels), but actually the &#8220;specialness&#8221; of that case has one other implication: there are better ways to evaluate medium-to-large box filters than &#8220;brute force&#8221;. I&#8217;d argue that the &#8220;equivalent&#8221; way (to your optimized local memories example) to evaluate a box filter on GPUs for instance is to build a summed-area table (separable parallel scan -&gt; O(N) for NxN elements) and then use it to evaluate arbitrarily large filters (O(1) per element). Indeed this implementation will probably surpass the speed of even the optimized &#8220;local memories&#8221; version for sufficiently large filters, and boasts excellent memory coherence for the case of constant-sized filters.</p>
<p>That said, the most efficient way to implement &#8220;scan&#8221; on GPUs right now (at least in CUDA) involves using the local memories <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Anyways I&#8217;m still a fan of software-managed caches (particularly at RapidMind we often *know* what data is coming up next, and thus can use &#8211; for example &#8211; Cell hardware very efficiently while we just have to trust GPUs to &#8220;do the right thing), but it&#8217;s not clear to me that they are necessarily superior on a theoretical level. They may be, but I need more convincing <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>The other orthogonal question is one of programming models. As I alluded to in the previous reply, there&#8217;s no reason why local memories have to be exposed as they are on Cell. Indeed there are nicer programming abstractions that can very effectively scale to N-level memory hierarchies without any application-programmer intervention. It may yet be another case where a higher-level abstraction will eventually lead to more efficient code, once the compilers are sufficiently mature.</p>
<p>Anyways, interesting stuff <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  Your blog so far has a 100% hit ratio on extremely relevant and thought-provoking material&#8230; keep it up!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Marco Salvi</title>
		<link>http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-44</link>
		<dc:creator>Marco Salvi</dc:creator>
		<pubDate>Sun, 16 Sep 2007 06:20:41 +0000</pubDate>
		<guid isPermaLink="false">http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-44</guid>
		<description>to Andrew: Any chance the researcher you spoke to co-authored this &lt;a href=&quot;http://csl.stanford.edu/~christos/publications/2007.pmarch.isca.pdf&quot; rel=&quot;nofollow&quot;&gt;paper&lt;/a&gt;? We briefly discussed about t on &lt;a href=&quot;http://forum.beyond3d.com/showthread.php?t=43997&quot; rel=&quot;nofollow&quot;&gt;Beyond3D&lt;/a&gt;, and the common consensus was that it&#039;s not entirely clear if the methodology they adopted for their tests makes completely sense (a cache should be less dense than a &#039;standard&#039; memory but it seems they didn&#039;t account for that..)
While I agree that (wisely) using a cache instead of a programmer managed memory is way simpler and effective (e.g you quickly reach some decent performance), I find difficult to believe that it can be as fast as a local store in the general case.

Moreover I have to agree with Uttar here, on G80 (and even more on R600..) your bottleneck is likely to be in the texture caches, while on CELL you can load a full vec4 per clock cycle and do some work on it knowing that no other event in the system can stop you from doing what you&#039;re doing. With G80 and its local-shared memory things are much more complicated if you don&#039;t want to  find your ALUs stepping over its each other feet all the time. (yeah, ALUs do have feet :) )
Anyway given the company you work for I guess you know this stuff much better than me :)

I understand your point of view, but as a console programmer I&#039;m not scared of devoting a lot of time to fine tune my algorithms for a specific game and platform. At the same time I can appreciate that this model doesn&#039;t work very well in many other industries.. thus the need for having solutions that might be less efficient but much more effective when real world constraints are applied. (That&#039;s why I think Intel is on something with Larrabee and its programming model..)

to lycium: you are right, what I wrote only applies to box filtering algorithms, but I was not obviously trying to imply it works with everything. Still current GPUs can&#039;t implement anything like that efficiently in one rendering pass, and also we should remember that box filters used recursively let us do a lot of &lt;a href=&quot;http://www.cs.cmu.edu/~ph/rif.ps.gz&quot; rel=&quot;nofollow&quot;&gt;nice things :)&lt;/a&gt;</description>
		<content:encoded><![CDATA[<p>to Andrew: Any chance the researcher you spoke to co-authored this <a href="http://csl.stanford.edu/~christos/publications/2007.pmarch.isca.pdf" rel="nofollow">paper</a>? We briefly discussed about t on <a href="http://forum.beyond3d.com/showthread.php?t=43997" rel="nofollow">Beyond3D</a>, and the common consensus was that it&#8217;s not entirely clear if the methodology they adopted for their tests makes completely sense (a cache should be less dense than a &#8217;standard&#8217; memory but it seems they didn&#8217;t account for that..)<br />
While I agree that (wisely) using a cache instead of a programmer managed memory is way simpler and effective (e.g you quickly reach some decent performance), I find difficult to believe that it can be as fast as a local store in the general case.</p>
<p>Moreover I have to agree with Uttar here, on G80 (and even more on R600..) your bottleneck is likely to be in the texture caches, while on CELL you can load a full vec4 per clock cycle and do some work on it knowing that no other event in the system can stop you from doing what you&#8217;re doing. With G80 and its local-shared memory things are much more complicated if you don&#8217;t want to  find your ALUs stepping over its each other feet all the time. (yeah, ALUs do have feet <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  )<br />
Anyway given the company you work for I guess you know this stuff much better than me <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>I understand your point of view, but as a console programmer I&#8217;m not scared of devoting a lot of time to fine tune my algorithms for a specific game and platform. At the same time I can appreciate that this model doesn&#8217;t work very well in many other industries.. thus the need for having solutions that might be less efficient but much more effective when real world constraints are applied. (That&#8217;s why I think Intel is on something with Larrabee and its programming model..)</p>
<p>to lycium: you are right, what I wrote only applies to box filtering algorithms, but I was not obviously trying to imply it works with everything. Still current GPUs can&#8217;t implement anything like that efficiently in one rendering pass, and also we should remember that box filters used recursively let us do a lot of <a href="http://www.cs.cmu.edu/~ph/rif.ps.gz" rel="nofollow">nice things <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lycium</title>
		<link>http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-42</link>
		<dc:creator>lycium</dc:creator>
		<pubDate>Sun, 16 Sep 2007 03:06:20 +0000</pubDate>
		<guid isPermaLink="false">http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-42</guid>
		<description>this analysis only holds true for the very specialised O(1) box filtering algorithm.</description>
		<content:encoded><![CDATA[<p>this analysis only holds true for the very specialised O(1) box filtering algorithm.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Arun Demeure</title>
		<link>http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-41</link>
		<dc:creator>Arun Demeure</dc:creator>
		<pubDate>Fri, 14 Sep 2007 16:09:13 +0000</pubDate>
		<guid isPermaLink="false">http://pixelstoomany.wordpress.com/2007/09/13/why-gpus-are-not-so-good-at-post-processing-images/#comment-41</guid>
		<description>Andrew: One worry I have with letting the GPU texture cache do its thing is that the problem is not just access patterns; it&#039;s also datapaths/number of requests per cycle.

With scratchpads like on G80 and CELL, you can retrieve a lot of data per cycle to your ALUs. It&#039;s practically as fast as the register file on G80... On the other hand, for every point sampled RGBA8 request on the 8800 GTX, you have the time to do 9.4 scalar operations (excluding the MUL). For scalar FP32, this is doubled, for reasons I will not go into here.

This is a topic that came up with Marco when I was discussing his shadowing algorithms with him. Even if he had 100% cache hits, my conclusion was that point sampling would be the bottleneck, not the ALUs.

I just looked at the CUDA convolution example, and indeed, this confirms my suspicion:
#if 0
            // try this to see the benefit of using shared memory
            int pixel = getPixel(g_data, x+dx, y+dy, imgw, imgh);
#else
            int pixel = SMEM(r+tx+dx, r+ty+dy);
#endif

            // only sum pixels within disc-shaped kernel
            float l = dx*dx + dy*dy;
            if (l &gt;8)&amp;0xff);
                float b = float((pixel&gt;&gt;16)&amp;0xff);
#if 1
                // brighten highlights
                float lum = (r + g + b) / (255*3);
                if (lum &gt; threshold) {
                    r *= highlight;
                    g *= highlight;
                    b *= highlight;
                }
#endif
                rsum += r;
                gsum += g;
                bsum += b;
                samples += 1.0;
            }</description>
		<content:encoded><![CDATA[<p>Andrew: One worry I have with letting the GPU texture cache do its thing is that the problem is not just access patterns; it&#8217;s also datapaths/number of requests per cycle.</p>
<p>With scratchpads like on G80 and CELL, you can retrieve a lot of data per cycle to your ALUs. It&#8217;s practically as fast as the register file on G80&#8230; On the other hand, for every point sampled RGBA8 request on the 8800 GTX, you have the time to do 9.4 scalar operations (excluding the MUL). For scalar FP32, this is doubled, for reasons I will not go into here.</p>
<p>This is a topic that came up with Marco when I was discussing his shadowing algorithms with him. Even if he had 100% cache hits, my conclusion was that point sampling would be the bottleneck, not the ALUs.</p>
<p>I just looked at the CUDA convolution example, and indeed, this confirms my suspicion:<br />
#if 0<br />
            // try this to see the benefit of using shared memory<br />
            int pixel = getPixel(g_data, x+dx, y+dy, imgw, imgh);<br />
#else<br />
            int pixel = SMEM(r+tx+dx, r+ty+dy);<br />
#endif</p>
<p>            // only sum pixels within disc-shaped kernel<br />
            float l = dx*dx + dy*dy;<br />
            if (l &gt;8)&amp;0xff);<br />
                float b = float((pixel&gt;&gt;16)&amp;0xff);<br />
#if 1<br />
                // brighten highlights<br />
                float lum = (r + g + b) / (255*3);<br />
                if (lum &gt; threshold) {<br />
                    r *= highlight;<br />
                    g *= highlight;<br />
                    b *= highlight;<br />
                }<br />
#endif<br />
                rsum += r;<br />
                gsum += g;<br />
                bsum += b;<br />
                samples += 1.0;<br />
            }</p>
]]></content:encoded>
	</item>
</channel>
</rss>
