Author Archive

Geometry-Shader-Free Bokeh Depth of Field Rendering

2012年08月1日 Leave a comment

Geometry-Shader-Free Bokeh Depth of Field Rendering

Depth Of Field is a popular post-process effect in real-time renderings. There have been many researches to improve the visual effects and efficiency of DOF effect in recent years. The most fast and widely used methods are still based on Gathering due to the nature of graphics hardware. There have also been Scattering based methods like Heat-diffusion DOF (Used in Metro 2033), and textured-Quad based Bokeh DOF (Used in Unreal Samaritan Demo and Crysis 2). The latter method gives very nice view of artist tweak-able shaped Bokeh, as well as proper foreground leaking, which are difficult to achieve with other methods. However this method consumes high fillrate, also it requires geometry shader to expand the bokeh quads, which might be slow in some of the hardware (The Crytek guys mentioned “Future: avoid geometry shader usage, and I suppose that people as smart as them have already done that). In this post I’m going to put my own naive implementation of the Non-geometry-shader Bokeh DOF method, wishing to get your better ideas upon this topic.

The implementation is based on UAV atomic internal counter. We uses compute shader to collect the bokeh data from the scene depth/color render targets, and emit the data into an Unordered access buffer. Then read this buffer as input to bokeh vertex shader, accumulate bokeh color with pixel shader, with front/back layer rendered into different areas of the , at last resolve and combine to the final scene.

Resource used:

1. RWStructuredBuffer<BokehData> BokehDataUAV;

This structured buffer is used to store the bokeh information, including pixel position and color. The UAV needs to be create with D3D11_BUFFER_UAV_FLAG_COUNTER flag.

BokehData need keep two float4 values for both position and color, but we do not require 32bit precision. So we encode float4 into int2 with f32tof16() instrinc and decode in the latter vertex shader with f16tof32. Note that half4 doesn’t help here (although it will compile, but the value read by vertex shader is not correct, wondering if it is a bug that the compiler didn’t report the problem while they say “this data type is only provided for language compatibility” in the SDK document).

2. RWBuffer<uint> DrawArgsUAV;

This buffer is used to store the counter from the above bokeh data buffer, to be consumed by the latter draw call ( DrawInstancedIndirect ). This way we prevent the stall to read-back the counter data back to CPU.

3. Texture2D<float4> AccumulationRT;

Used to accumulate the quad colors. It covers two viewports, one for front layer, another for back layer, so its size should be ( HalfViewportWidth, HalfViewportHeight * 2 + MAX_COC_IN_PIXELS ).


The rendering process is fairly straight forward:

1. Compute shader stage 1: For each pixel in the input depth/color texture, compute COC from depth, then decide which layer it should lay in. For the front/back layers ( that abs(COC) > COC_THRESHOLD ) we emit them into the BokehDataUAV, and increase the buffer’s count:

void EmitBokehVertex( float4 colorcoc, float fDepth, uint nSamples, uint2 loc )


    uint count = BokehDataUAV.IncrementCounter();

    BokehData data;

    data.PositionEncoded = EncodeFloat4ToInt2(float4(

            float2(2, -2) * (loc * RcpFullViewportWH – 0.5f),



    data.ColorEncoded = EncodeFloat4ToInt2(float4(colorcoc.rgb, 1.0f));

    BokehDataUAV[count] = data;


2. Compute shader stage 2: This is a very simple compute shader that copies the counter into the args buffer:


void cs_copyCounter()


    BokehCountUAV[0] = 4;

    BokehCountUAV[1] = BokehDataUAV.IncrementCounter();

    BokehCountUAV[2] = 0;

    BokehCountUAV[3] = 0;


3. Draw quads on to AccumulationRT. Draw with DrawInstancedIndirect(DrawArgsBuffer, 0), each instance for each quad. Using a simple 4-vertices quad VB as stream input, and the shader resource view of BokehData buffer as shader resource input, indexed by the SV_InstanceID. The vertex shader expands vertices with COC size, and put it into the proper viewport area. Since there’s no SV_ViewportArrayIndex writeable from vertex shader, we just use an MAD to move it with proper offset along the Y axis. (See image below, top for front layer, bottom for back layer.

void vs_bokeh(

    float4 vPos : POSITION,

    uint instanceId : SV_InstanceId,

    out float4 vPosOut : SV_Position,

    out float4 vColorOut : TEXCOORD0,

    out float3 vTCOut : TEXCOORD1)


    BokehData dataEncoded = BokehDataSRV[instanceId];

    float4 dataPosition = DecodeFloat4FromInt2(dataEncoded.PositionEncoded);

    float4 dataColor = DecodeFloat4FromInt2(dataEncoded.ColorEncoded);

    vPosOut.xy = vPos.xy * abs(dataPosition.w) * RcpFullViewportWH / 2 + dataPosition.xy;

    if(dataPosition.w < 0)


        vPosOut.y = vPosOut.y * DualViewportOffset.x + DualViewportOffset.y;




        vPosOut.y = vPosOut.y * DualViewportOffset.x – DualViewportOffset.y;


    vPosOut.z = 0.5f;

    vPosOut.w = 1.0f;

    vColorOut = dataColor;

    vTCOut = float3((vPos.xy + 1.0f) * 0.5f, dataPosition.z);



4. Resolve the texture into final RT.


The method removed the dependency on geometry shader, and it reduces the redundant vertex shader work. The compute shader stage is rather fast and is fill-rate bounded. With maximum 64 * 64 size across entire screen, it runs at 90 fps at 1080p (0.3 ms for CS, 8 ms for drawing quads), and 700 fps at 480p on my Radeon 6870. It seems that the RWBuffer’s internal counter ( with “IncreamentCounter” ) is much faster than atomic operations directly on the unordered access view ( with “InterlockedAdd”), I tried use Interlocked to ArgsBuffer directly add first, but it ends up with almose 30 ms to execute the compute shader.

Future work:

The composition of the 3 layers seems problematic to me, result in some artifacts that the back layer leaking or ugly front layer hard edges. The naïve blending method need to be improved. Also I want to implement the geometry shader version and compare the performance differences. Since this is not a scientific article, I’m just post the results got so far here to share the idea and would like to discuss with anyone interested in the topic.

ps: The SIGGRAPH 2012 is near! Wish to see their talented ideas about DOF and all the interesting things in real-time and unreal time renderings soon. \o/

Inside AMD 7900 ‘Leo’ demo: A bit more about their tile based forward lighting

2012年02月6日 Leave a comment

After some unzipping of their demo, and debugging with GPUPerfStudio, I found that my previous understanding of their method is quite close to the truth :). But their implementation, in terms of light culling, looks much more efficient (<0.1 ms in the demo). and here is the key difference. Note that the demo team did a lot of experimental branch code path, but many of them are not used at last. So I’m going to talk only about what is actually running in the final demo.

  • There are actually two Compute Shaders invoked across the frame. One for computing frustum for each tile, the other for culling. While in Intel’s demo they do Z reduction, frustum computing, and lighting in one shader.
  • Difference with thread dispatching:
    – 32 * 32 pixels for each tile, (vs 16 * 16);
    – 64 thread groups for each tile, (vs 16 * 16);
    – Dispatched groups is in 1D ( TileCount, 1, 1), (vs 2D ( TileCol, TileRow, 1 ))
  • They are using a 1D buffer for storing light indices.
  • They used two buffer for light properties, one for position and range, one for color. So there are less data to bind when doing culling pass.
  • They are not culling with Z-reduction. (There is actually a Z-pre pass ran for the demo, and they have a branch to do the Z reduction, but it is not used, at least for this release. Maybe they are going to provide some options to tweak it on off in the next releases.).

I tried with these hints in my test program and the result is obviously improved. I still kept depth reduction in the culling shader. Since now there are 64 threads for 32 * 32 pixels, I can calculate a minZ/maxZ for the 16 pixels within a thread, then do the interlocked min/max operation to write to group shared memory. I assume this is helpful for performance. The same could apply when writing light indices to group shared memory.

Here is a scene with 1024 big lights, and fairly depth complexity, run at around 200 fps, and light culling consumed 0.9ms according to GPUPerfStudio:

Thanks to the Rock model exported from UDK used in the test, copyright belongs to original author. And thanks to AMD’s demo team with this great implementation. As metioned in the last post, I believe the true power of this way is the  ability to use much more complicated shading models, and hope this could be “the way DX11’s meant to be played”…

A new era of forward shading is coming?

2012年01月30日 1 comment

A new era of forward shading is coming?!  –A glance at forward shading for massive lights with computer shader light culling from HD7970 demo ‘Leo’.

The past years is dominated by Deferred shading. It seems everyone likes lights, massive lights. From STALKER to KillZone and StarcraftII. Crytek said they are “a bit more deferred” in CE3, even Unreal is doing deferred shading in their ‘Samaritan’ demo. DICE the exploreres, did a good job in doing a ‘Tile based’ deferred shading with Computer shaders in their Frostbite 2 engine. However, they didn’t realize that this could cause an end of the era of Defered shading.

In the recent released demo for their new HD 7970 card, AMD managed to do a forward shading with massive lights, with compute shader. After downloading their demo and tried with a existing piece of code from Intel, I seem to get the point, so lets share the idea. For those guys didn’t quite familiar with computer shader tile-based deferred shading, I’d suggest you look at Intel’s article first. My test is also based on their code too.

My implementation is quite similar to DICE’s tile based deferred shading. However, instead of writing g-buffer in the geometry pass, I just disabled color writing and did a Z-only pass. Then, run the tile-based compute shader, this shader do such steps:

1. Sample Z buffer,

2. Calculate the min-Z and max-Z for each 16*16 tile (a thread group) with interlocked operations, store them in group-shared memory. ( However, you could just use near and far clip instead of above steps, to do a pure ‘forward’ way ).

3. For each light, calculate if the light intersect with current frustum for the tile. (this is parallel on each thread in the group, so if you have 1024 lights, each thread only do 4 intersection calculations).

(So far it is the same with original tile-based deferred shading).

4. Interlocked increase the group-shared light-counter, write the light index to LightIndexUAV ( Unordered access view ), at the location calulated from light-counter. The LightIndexUAV is the same size as viewport (probably need an edge extension), with format R16_UINT, so there are 16*16 texels for us to store the light index for each tile.

Then, unbind the UAV and set as SRV, run forward shading for each renderable object. This time we continuously sample from the LightIndexSRV for the tile the current shading pixel lays in, untill we arrive the clear color ( e.g. 0xffff in my case), for each sampled LightIndex, sample the actual light data buffer with this value, and accumulate the lighting as normal forward shading.

At last we get a shaded scene: (there’s no difference right? but notice the ‘No cull forward’)

Performance: ( in fps, Power plant , camera at init position, 720p, 1024 lights, on radeon hd6870 )

MSAA           off         2x           4x           8x

Forward      161        139         122          91

Deferred     262       159         123          69

One thing that makes me unhappy is the performance is not beating deferred shading when using no or low MSAA. We still need to optimize the way. However, it do give us a posibility to use native MSAA and more complex shading models, right?

Code is attached here if you want to try or help me with optimization. (Plz rename it to zip). You need to download intel’s demo and replace these files. Media files are just to big to upload here, Sorry.

Btw, This is not to involve some war between deferred and forward. Myself is a fan of deferred shading, LOL.

At last, is the ‘Leo’ Demo making fun of somebody?

Compile Error C3918 , for event in Cli/C++

2010年10月13日 Leave a comment

In C#, we can check if an event variable is null before firing the event. For example:

public class MyClass
  public event EventHandler MyEvent;
  public void FireEvent()
    // Check if there is any event handler registered.
    if (MyEvent != null) { MyEvent(this, new EventArgs()); }

But if we do the same thing in C++/CLI, we will get an compile error C3918.

ref class MyClass
  event EventHandler^ MyEvent;
  void FireEvent()
    if(MyEvent != nullptr) // C3918
      MyEvent(this, gcnew EventArgs());

Here is the solution:

public ref class MyClass
  EventHandler^ m_myEvent;
  event EventHandler^ MyEvent
    void add(EventHandler^ handler) { m_myEvent += handler; }
    void remove(EventHandler^ handler) { m_myEvent -= handler; }
    void raise(Object^ sender, EventArgs^ e)
      // Check if there is any event handler registered.
      if (m_myEvent != nullptr)
        m_myEvent->Invoke(sender, e);

Categories: 未分类

金正日,请把你的大便留下来 zz

2010年05月11日 Leave a comment

习事 小,关键是不能影响自己的品质,比方说那些老是给老师打小报告,吃里扒外的人,我就从来不理。我有我的原则。


我们国家上次这么干, 也是百余年前,袁世凯当大总统的时候了。








的。这两个撒旦之国碰在一起,在上帝毁灭它之 前,若你是义人,请你逃离索多玛,不要回头。




Categories: 娱乐

Let’s get rid of DirectInput

2010年04月22日 Leave a comment

"DirectInput is a set of API calls that abstracts input devices on the
system. Internally, DirectInput creates a second thread to read
WM_INPUT data, and using the DirectInput APIs will add more overhead
than simply reading WM_INPUT directly. DirectInput is only useful for
reading data from DirectInput joysticks; however, if you only need to
support the Xbox 360 controller for Windows, then use XInput
instead. Overall, using DirectInput offers no advantages when reading
data from mouse or keyboard devices, and the use of DirectInput in these
scenarios is discouraged."

I’m currently working on a wrapper class for keyboard and mouse inputs using this WM_INPUT and raw-input data. It will be here soon.

Categories: 未分类

Extensions to Luna to support userdata as return values from a lua-c call

2010年04月21日 Leave a comment
As a helper to Lua-C interops, Luna is pretty small and elegant without fucking usage of boost or python.
But there seems to be no support for userdata ( such as a C class pointer, witch will be used in lua ) as return values from a c function. So we made a bit extension.

With a public member function "push_userdata":

    static void push_userdata(lua_State *L, T* pT) {
        userdataType *ud =
            static_cast<userdataType*>(lua_newuserdata(L, sizeof(userdataType)));
        ud->pT = pT;
        luaL_getmetatable(L, T::className);
        lua_setmetatable(L, -2);

and a macro

#define PUSH_USERDATA(ClassName, LuaState, UserDataPtr) Luna<ClassName>::push_userdata(LuaState, UserDataPtr);
#define DECL_SCRIPT_METHOD(FunctionName) int FunctionName(lua_State* L)
#define IMPL_SCRIPT_METHOD(ClassName, FunctionName) int ClassName::FunctionName(lua_State* L)

So we can use that:

//in .h
class B;
class A {
    B* m_pB;
    B* GetB() { return m_pB; }


//in .cpp

    PUSH_USERDATA(A, L, this->GetB())
    return 1;

It seems working correctly. A new user data is created once the method is called from lua.
I’m doubtting if there are safety or efficiency problems with this implementation. Farther tests are taking on and hope your advise 🙂

Categories: 未分类