Posts Tagged ‘AMD’

Inside AMD 7900 ‘Leo’ demo: A bit more about their tile based forward lighting

2012年02月6日 Leave a comment

After some unzipping of their demo, and debugging with GPUPerfStudio, I found that my previous understanding of their method is quite close to the truth :). But their implementation, in terms of light culling, looks much more efficient (<0.1 ms in the demo). and here is the key difference. Note that the demo team did a lot of experimental branch code path, but many of them are not used at last. So I’m going to talk only about what is actually running in the final demo.

  • There are actually two Compute Shaders invoked across the frame. One for computing frustum for each tile, the other for culling. While in Intel’s demo they do Z reduction, frustum computing, and lighting in one shader.
  • Difference with thread dispatching:
    – 32 * 32 pixels for each tile, (vs 16 * 16);
    – 64 thread groups for each tile, (vs 16 * 16);
    – Dispatched groups is in 1D ( TileCount, 1, 1), (vs 2D ( TileCol, TileRow, 1 ))
  • They are using a 1D buffer for storing light indices.
  • They used two buffer for light properties, one for position and range, one for color. So there are less data to bind when doing culling pass.
  • They are not culling with Z-reduction. (There is actually a Z-pre pass ran for the demo, and they have a branch to do the Z reduction, but it is not used, at least for this release. Maybe they are going to provide some options to tweak it on off in the next releases.).

I tried with these hints in my test program and the result is obviously improved. I still kept depth reduction in the culling shader. Since now there are 64 threads for 32 * 32 pixels, I can calculate a minZ/maxZ for the 16 pixels within a thread, then do the interlocked min/max operation to write to group shared memory. I assume this is helpful for performance. The same could apply when writing light indices to group shared memory.

Here is a scene with 1024 big lights, and fairly depth complexity, run at around 200 fps, and light culling consumed 0.9ms according to GPUPerfStudio:

Thanks to the Rock model exported from UDK used in the test, copyright belongs to original author. And thanks to AMD’s demo team with this great implementation. As metioned in the last post, I believe the true power of this way is the  ability to use much more complicated shading models, and hope this could be “the way DX11’s meant to be played”…

A new era of forward shading is coming?

2012年01月30日 1 comment

A new era of forward shading is coming?!  –A glance at forward shading for massive lights with computer shader light culling from HD7970 demo ‘Leo’.

The past years is dominated by Deferred shading. It seems everyone likes lights, massive lights. From STALKER to KillZone and StarcraftII. Crytek said they are “a bit more deferred” in CE3, even Unreal is doing deferred shading in their ‘Samaritan’ demo. DICE the exploreres, did a good job in doing a ‘Tile based’ deferred shading with Computer shaders in their Frostbite 2 engine. However, they didn’t realize that this could cause an end of the era of Defered shading.

In the recent released demo for their new HD 7970 card, AMD managed to do a forward shading with massive lights, with compute shader. After downloading their demo and tried with a existing piece of code from Intel, I seem to get the point, so lets share the idea. For those guys didn’t quite familiar with computer shader tile-based deferred shading, I’d suggest you look at Intel’s article first. My test is also based on their code too.

My implementation is quite similar to DICE’s tile based deferred shading. However, instead of writing g-buffer in the geometry pass, I just disabled color writing and did a Z-only pass. Then, run the tile-based compute shader, this shader do such steps:

1. Sample Z buffer,

2. Calculate the min-Z and max-Z for each 16*16 tile (a thread group) with interlocked operations, store them in group-shared memory. ( However, you could just use near and far clip instead of above steps, to do a pure ‘forward’ way ).

3. For each light, calculate if the light intersect with current frustum for the tile. (this is parallel on each thread in the group, so if you have 1024 lights, each thread only do 4 intersection calculations).

(So far it is the same with original tile-based deferred shading).

4. Interlocked increase the group-shared light-counter, write the light index to LightIndexUAV ( Unordered access view ), at the location calulated from light-counter. The LightIndexUAV is the same size as viewport (probably need an edge extension), with format R16_UINT, so there are 16*16 texels for us to store the light index for each tile.

Then, unbind the UAV and set as SRV, run forward shading for each renderable object. This time we continuously sample from the LightIndexSRV for the tile the current shading pixel lays in, untill we arrive the clear color ( e.g. 0xffff in my case), for each sampled LightIndex, sample the actual light data buffer with this value, and accumulate the lighting as normal forward shading.

At last we get a shaded scene: (there’s no difference right? but notice the ‘No cull forward’)

Performance: ( in fps, Power plant , camera at init position, 720p, 1024 lights, on radeon hd6870 )

MSAA           off         2x           4x           8x

Forward      161        139         122          91

Deferred     262       159         123          69

One thing that makes me unhappy is the performance is not beating deferred shading when using no or low MSAA. We still need to optimize the way. However, it do give us a posibility to use native MSAA and more complex shading models, right?

Code is attached here if you want to try or help me with optimization. (Plz rename it to zip). You need to download intel’s demo and replace these files. Media files are just to big to upload here, Sorry.

Btw, This is not to involve some war between deferred and forward. Myself is a fan of deferred shading, LOL.

At last, is the ‘Leo’ Demo making fun of somebody?