Archive

Posts Tagged ‘Direct3D’

Geometry-Shader-Free Bokeh Depth of Field Rendering

2012年08月1日 Leave a comment

Geometry-Shader-Free Bokeh Depth of Field Rendering

Depth Of Field is a popular post-process effect in real-time renderings. There have been many researches to improve the visual effects and efficiency of DOF effect in recent years. The most fast and widely used methods are still based on Gathering due to the nature of graphics hardware. There have also been Scattering based methods like Heat-diffusion DOF (Used in Metro 2033), and textured-Quad based Bokeh DOF (Used in Unreal Samaritan Demo and Crysis 2). The latter method gives very nice view of artist tweak-able shaped Bokeh, as well as proper foreground leaking, which are difficult to achieve with other methods. However this method consumes high fillrate, also it requires geometry shader to expand the bokeh quads, which might be slow in some of the hardware (The Crytek guys mentioned “Future: avoid geometry shader usage, and I suppose that people as smart as them have already done that). In this post I’m going to put my own naive implementation of the Non-geometry-shader Bokeh DOF method, wishing to get your better ideas upon this topic.

The implementation is based on UAV atomic internal counter. We uses compute shader to collect the bokeh data from the scene depth/color render targets, and emit the data into an Unordered access buffer. Then read this buffer as input to bokeh vertex shader, accumulate bokeh color with pixel shader, with front/back layer rendered into different areas of the , at last resolve and combine to the final scene.

Resource used:

1. RWStructuredBuffer<BokehData> BokehDataUAV;

This structured buffer is used to store the bokeh information, including pixel position and color. The UAV needs to be create with D3D11_BUFFER_UAV_FLAG_COUNTER flag.

BokehData need keep two float4 values for both position and color, but we do not require 32bit precision. So we encode float4 into int2 with f32tof16() instrinc and decode in the latter vertex shader with f16tof32. Note that half4 doesn’t help here (although it will compile, but the value read by vertex shader is not correct, wondering if it is a bug that the compiler didn’t report the problem while they say “this data type is only provided for language compatibility” in the SDK document).

2. RWBuffer<uint> DrawArgsUAV;

This buffer is used to store the counter from the above bokeh data buffer, to be consumed by the latter draw call ( DrawInstancedIndirect ). This way we prevent the stall to read-back the counter data back to CPU.

3. Texture2D<float4> AccumulationRT;

Used to accumulate the quad colors. It covers two viewports, one for front layer, another for back layer, so its size should be ( HalfViewportWidth, HalfViewportHeight * 2 + MAX_COC_IN_PIXELS ).

Rendering:

The rendering process is fairly straight forward:

1. Compute shader stage 1: For each pixel in the input depth/color texture, compute COC from depth, then decide which layer it should lay in. For the front/back layers ( that abs(COC) > COC_THRESHOLD ) we emit them into the BokehDataUAV, and increase the buffer’s count:

void EmitBokehVertex( float4 colorcoc, float fDepth, uint nSamples, uint2 loc )

{

    uint count = BokehDataUAV.IncrementCounter();

    BokehData data;

    data.PositionEncoded = EncodeFloat4ToInt2(float4(

            float2(2, -2) * (loc * RcpFullViewportWH – 0.5f),

            fDepth,

            colorcoc.a));

    data.ColorEncoded = EncodeFloat4ToInt2(float4(colorcoc.rgb, 1.0f));

    BokehDataUAV[count] = data;

}

2. Compute shader stage 2: This is a very simple compute shader that copies the counter into the args buffer:

[numthreads(1,1,1)]

void cs_copyCounter()

{

    BokehCountUAV[0] = 4;

    BokehCountUAV[1] = BokehDataUAV.IncrementCounter();

    BokehCountUAV[2] = 0;

    BokehCountUAV[3] = 0;

}

3. Draw quads on to AccumulationRT. Draw with DrawInstancedIndirect(DrawArgsBuffer, 0), each instance for each quad. Using a simple 4-vertices quad VB as stream input, and the shader resource view of BokehData buffer as shader resource input, indexed by the SV_InstanceID. The vertex shader expands vertices with COC size, and put it into the proper viewport area. Since there’s no SV_ViewportArrayIndex writeable from vertex shader, we just use an MAD to move it with proper offset along the Y axis. (See image below, top for front layer, bottom for back layer.

void vs_bokeh(

    float4 vPos : POSITION,

    uint instanceId : SV_InstanceId,

    out float4 vPosOut : SV_Position,

    out float4 vColorOut : TEXCOORD0,

    out float3 vTCOut : TEXCOORD1)

{

    BokehData dataEncoded = BokehDataSRV[instanceId];

    float4 dataPosition = DecodeFloat4FromInt2(dataEncoded.PositionEncoded);

    float4 dataColor = DecodeFloat4FromInt2(dataEncoded.ColorEncoded);

    vPosOut.xy = vPos.xy * abs(dataPosition.w) * RcpFullViewportWH / 2 + dataPosition.xy;

    if(dataPosition.w < 0)

    {

        vPosOut.y = vPosOut.y * DualViewportOffset.x + DualViewportOffset.y;

    }

    else

    {

        vPosOut.y = vPosOut.y * DualViewportOffset.x – DualViewportOffset.y;

    }

    vPosOut.z = 0.5f;

    vPosOut.w = 1.0f;

    vColorOut = dataColor;

    vTCOut = float3((vPos.xy + 1.0f) * 0.5f, dataPosition.z);

}

accumbuf-dualviewport

4. Resolve the texture into final RT.

Performance:

The method removed the dependency on geometry shader, and it reduces the redundant vertex shader work. The compute shader stage is rather fast and is fill-rate bounded. With maximum 64 * 64 size across entire screen, it runs at 90 fps at 1080p (0.3 ms for CS, 8 ms for drawing quads), and 700 fps at 480p on my Radeon 6870. It seems that the RWBuffer’s internal counter ( with “IncreamentCounter” ) is much faster than atomic operations directly on the unordered access view ( with “InterlockedAdd”), I tried use Interlocked to ArgsBuffer directly add first, but it ends up with almose 30 ms to execute the compute shader.

Future work:

The composition of the 3 layers seems problematic to me, result in some artifacts that the back layer leaking or ugly front layer hard edges. The naïve blending method need to be improved. Also I want to implement the geometry shader version and compare the performance differences. Since this is not a scientific article, I’m just post the results got so far here to share the idea and would like to discuss with anyone interested in the topic.

ps: The SIGGRAPH 2012 is near! Wish to see their talented ideas about DOF and all the interesting things in real-time and unreal time renderings soon. \o/

Advertisements