Home > 计算机与 Internet, 未分类 > Inside AMD 7900 ‘Leo’ demo: A bit more about their tile based forward lighting

Inside AMD 7900 ‘Leo’ demo: A bit more about their tile based forward lighting

After some unzipping of their demo, and debugging with GPUPerfStudio, I found that my previous understanding of their method is quite close to the truth :). But their implementation, in terms of light culling, looks much more efficient (<0.1 ms in the demo). and here is the key difference. Note that the demo team did a lot of experimental branch code path, but many of them are not used at last. So I’m going to talk only about what is actually running in the final demo.

  • There are actually two Compute Shaders invoked across the frame. One for computing frustum for each tile, the other for culling. While in Intel’s demo they do Z reduction, frustum computing, and lighting in one shader.
  • Difference with thread dispatching:
    – 32 * 32 pixels for each tile, (vs 16 * 16);
    – 64 thread groups for each tile, (vs 16 * 16);
    – Dispatched groups is in 1D ( TileCount, 1, 1), (vs 2D ( TileCol, TileRow, 1 ))
  • They are using a 1D buffer for storing light indices.
  • They used two buffer for light properties, one for position and range, one for color. So there are less data to bind when doing culling pass.
  • They are not culling with Z-reduction. (There is actually a Z-pre pass ran for the demo, and they have a branch to do the Z reduction, but it is not used, at least for this release. Maybe they are going to provide some options to tweak it on off in the next releases.).

I tried with these hints in my test program and the result is obviously improved. I still kept depth reduction in the culling shader. Since now there are 64 threads for 32 * 32 pixels, I can calculate a minZ/maxZ for the 16 pixels within a thread, then do the interlocked min/max operation to write to group shared memory. I assume this is helpful for performance. The same could apply when writing light indices to group shared memory.

Here is a scene with 1024 big lights, and fairly depth complexity, run at around 200 fps, and light culling consumed 0.9ms according to GPUPerfStudio:

Thanks to the Rock model exported from UDK used in the test, copyright belongs to original author. And thanks to AMD’s demo team with this great implementation. As metioned in the last post, I believe the true power of this way is the  ability to use much more complicated shading models, and hope this could be “the way DX11’s meant to be played”…

  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: