Archive

Author Archive

Geometry-Shader-Free Bokeh Depth of Field Rendering

2012年08月1日 Leave a comment

Geometry-Shader-Free Bokeh Depth of Field Rendering

Depth Of Field is a popular post-process effect in real-time renderings. There have been many researches to improve the visual effects and efficiency of DOF effect in recent years. The most fast and widely used methods are still based on Gathering due to the nature of graphics hardware. There have also been Scattering based methods like Heat-diffusion DOF (Used in Metro 2033), and textured-Quad based Bokeh DOF (Used in Unreal Samaritan Demo and Crysis 2). The latter method gives very nice view of artist tweak-able shaped Bokeh, as well as proper foreground leaking, which are difficult to achieve with other methods. However this method consumes high fillrate, also it requires geometry shader to expand the bokeh quads, which might be slow in some of the hardware (The Crytek guys mentioned “Future: avoid geometry shader usage, and I suppose that people as smart as them have already done that). In this post I’m going to put my own naive implementation of the Non-geometry-shader Bokeh DOF method, wishing to get your better ideas upon this topic.

The implementation is based on UAV atomic internal counter. We uses compute shader to collect the bokeh data from the scene depth/color render targets, and emit the data into an Unordered access buffer. Then read this buffer as input to bokeh vertex shader, accumulate bokeh color with pixel shader, with front/back layer rendered into different areas of the , at last resolve and combine to the final scene.

Resource used:

1. RWStructuredBuffer<BokehData> BokehDataUAV;

This structured buffer is used to store the bokeh information, including pixel position and color. The UAV needs to be create with D3D11_BUFFER_UAV_FLAG_COUNTER flag.

BokehData need keep two float4 values for both position and color, but we do not require 32bit precision. So we encode float4 into int2 with f32tof16() instrinc and decode in the latter vertex shader with f16tof32. Note that half4 doesn’t help here (although it will compile, but the value read by vertex shader is not correct, wondering if it is a bug that the compiler didn’t report the problem while they say “this data type is only provided for language compatibility” in the SDK document).

2. RWBuffer<uint> DrawArgsUAV;

This buffer is used to store the counter from the above bokeh data buffer, to be consumed by the latter draw call ( DrawInstancedIndirect ). This way we prevent the stall to read-back the counter data back to CPU.

3. Texture2D<float4> AccumulationRT;

Used to accumulate the quad colors. It covers two viewports, one for front layer, another for back layer, so its size should be ( HalfViewportWidth, HalfViewportHeight * 2 + MAX_COC_IN_PIXELS ).

Rendering:

The rendering process is fairly straight forward:

1. Compute shader stage 1: For each pixel in the input depth/color texture, compute COC from depth, then decide which layer it should lay in. For the front/back layers ( that abs(COC) > COC_THRESHOLD ) we emit them into the BokehDataUAV, and increase the buffer’s count:

void EmitBokehVertex( float4 colorcoc, float fDepth, uint nSamples, uint2 loc )

{

    uint count = BokehDataUAV.IncrementCounter();

    BokehData data;

    data.PositionEncoded = EncodeFloat4ToInt2(float4(

            float2(2, -2) * (loc * RcpFullViewportWH – 0.5f),

            fDepth,

            colorcoc.a));

    data.ColorEncoded = EncodeFloat4ToInt2(float4(colorcoc.rgb, 1.0f));

    BokehDataUAV[count] = data;

}

2. Compute shader stage 2: This is a very simple compute shader that copies the counter into the args buffer:

[numthreads(1,1,1)]

void cs_copyCounter()

{

    BokehCountUAV[0] = 4;

    BokehCountUAV[1] = BokehDataUAV.IncrementCounter();

    BokehCountUAV[2] = 0;

    BokehCountUAV[3] = 0;

}

3. Draw quads on to AccumulationRT. Draw with DrawInstancedIndirect(DrawArgsBuffer, 0), each instance for each quad. Using a simple 4-vertices quad VB as stream input, and the shader resource view of BokehData buffer as shader resource input, indexed by the SV_InstanceID. The vertex shader expands vertices with COC size, and put it into the proper viewport area. Since there’s no SV_ViewportArrayIndex writeable from vertex shader, we just use an MAD to move it with proper offset along the Y axis. (See image below, top for front layer, bottom for back layer.

void vs_bokeh(

    float4 vPos : POSITION,

    uint instanceId : SV_InstanceId,

    out float4 vPosOut : SV_Position,

    out float4 vColorOut : TEXCOORD0,

    out float3 vTCOut : TEXCOORD1)

{

    BokehData dataEncoded = BokehDataSRV[instanceId];

    float4 dataPosition = DecodeFloat4FromInt2(dataEncoded.PositionEncoded);

    float4 dataColor = DecodeFloat4FromInt2(dataEncoded.ColorEncoded);

    vPosOut.xy = vPos.xy * abs(dataPosition.w) * RcpFullViewportWH / 2 + dataPosition.xy;

    if(dataPosition.w < 0)

    {

        vPosOut.y = vPosOut.y * DualViewportOffset.x + DualViewportOffset.y;

    }

    else

    {

        vPosOut.y = vPosOut.y * DualViewportOffset.x – DualViewportOffset.y;

    }

    vPosOut.z = 0.5f;

    vPosOut.w = 1.0f;

    vColorOut = dataColor;

    vTCOut = float3((vPos.xy + 1.0f) * 0.5f, dataPosition.z);

}

accumbuf-dualviewport

4. Resolve the texture into final RT.

Performance:

The method removed the dependency on geometry shader, and it reduces the redundant vertex shader work. The compute shader stage is rather fast and is fill-rate bounded. With maximum 64 * 64 size across entire screen, it runs at 90 fps at 1080p (0.3 ms for CS, 8 ms for drawing quads), and 700 fps at 480p on my Radeon 6870. It seems that the RWBuffer’s internal counter ( with “IncreamentCounter” ) is much faster than atomic operations directly on the unordered access view ( with “InterlockedAdd”), I tried use Interlocked to ArgsBuffer directly add first, but it ends up with almose 30 ms to execute the compute shader.

Future work:

The composition of the 3 layers seems problematic to me, result in some artifacts that the back layer leaking or ugly front layer hard edges. The naïve blending method need to be improved. Also I want to implement the geometry shader version and compare the performance differences. Since this is not a scientific article, I’m just post the results got so far here to share the idea and would like to discuss with anyone interested in the topic.

ps: The SIGGRAPH 2012 is near! Wish to see their talented ideas about DOF and all the interesting things in real-time and unreal time renderings soon. \o/

Advertisements

Inside AMD 7900 ‘Leo’ demo: A bit more about their tile based forward lighting

2012年02月6日 Leave a comment

After some unzipping of their demo, and debugging with GPUPerfStudio, I found that my previous understanding of their method is quite close to the truth :). But their implementation, in terms of light culling, looks much more efficient (<0.1 ms in the demo). and here is the key difference. Note that the demo team did a lot of experimental branch code path, but many of them are not used at last. So I’m going to talk only about what is actually running in the final demo.

  • There are actually two Compute Shaders invoked across the frame. One for computing frustum for each tile, the other for culling. While in Intel’s demo they do Z reduction, frustum computing, and lighting in one shader.
  • Difference with thread dispatching:
    – 32 * 32 pixels for each tile, (vs 16 * 16);
    – 64 thread groups for each tile, (vs 16 * 16);
    – Dispatched groups is in 1D ( TileCount, 1, 1), (vs 2D ( TileCol, TileRow, 1 ))
  • They are using a 1D buffer for storing light indices.
  • They used two buffer for light properties, one for position and range, one for color. So there are less data to bind when doing culling pass.
  • They are not culling with Z-reduction. (There is actually a Z-pre pass ran for the demo, and they have a branch to do the Z reduction, but it is not used, at least for this release. Maybe they are going to provide some options to tweak it on off in the next releases.).

I tried with these hints in my test program and the result is obviously improved. I still kept depth reduction in the culling shader. Since now there are 64 threads for 32 * 32 pixels, I can calculate a minZ/maxZ for the 16 pixels within a thread, then do the interlocked min/max operation to write to group shared memory. I assume this is helpful for performance. The same could apply when writing light indices to group shared memory.

Here is a scene with 1024 big lights, and fairly depth complexity, run at around 200 fps, and light culling consumed 0.9ms according to GPUPerfStudio:

Thanks to the Rock model exported from UDK used in the test, copyright belongs to original author. And thanks to AMD’s demo team with this great implementation. As metioned in the last post, I believe the true power of this way is the  ability to use much more complicated shading models, and hope this could be “the way DX11’s meant to be played”…

A new era of forward shading is coming?

2012年01月30日 1 comment

A new era of forward shading is coming?!  –A glance at forward shading for massive lights with computer shader light culling from HD7970 demo ‘Leo’.

The past years is dominated by Deferred shading. It seems everyone likes lights, massive lights. From STALKER to KillZone and StarcraftII. Crytek said they are “a bit more deferred” in CE3, even Unreal is doing deferred shading in their ‘Samaritan’ demo. DICE the exploreres, did a good job in doing a ‘Tile based’ deferred shading with Computer shaders in their Frostbite 2 engine. However, they didn’t realize that this could cause an end of the era of Defered shading.

In the recent released demo for their new HD 7970 card, AMD managed to do a forward shading with massive lights, with compute shader. After downloading their demo and tried with a existing piece of code from Intel, I seem to get the point, so lets share the idea. For those guys didn’t quite familiar with computer shader tile-based deferred shading, I’d suggest you look at Intel’s article first. My test is also based on their code too.

My implementation is quite similar to DICE’s tile based deferred shading. However, instead of writing g-buffer in the geometry pass, I just disabled color writing and did a Z-only pass. Then, run the tile-based compute shader, this shader do such steps:

1. Sample Z buffer,

2. Calculate the min-Z and max-Z for each 16*16 tile (a thread group) with interlocked operations, store them in group-shared memory. ( However, you could just use near and far clip instead of above steps, to do a pure ‘forward’ way ).

3. For each light, calculate if the light intersect with current frustum for the tile. (this is parallel on each thread in the group, so if you have 1024 lights, each thread only do 4 intersection calculations).

(So far it is the same with original tile-based deferred shading).

4. Interlocked increase the group-shared light-counter, write the light index to LightIndexUAV ( Unordered access view ), at the location calulated from light-counter. The LightIndexUAV is the same size as viewport (probably need an edge extension), with format R16_UINT, so there are 16*16 texels for us to store the light index for each tile.

Then, unbind the UAV and set as SRV, run forward shading for each renderable object. This time we continuously sample from the LightIndexSRV for the tile the current shading pixel lays in, untill we arrive the clear color ( e.g. 0xffff in my case), for each sampled LightIndex, sample the actual light data buffer with this value, and accumulate the lighting as normal forward shading.

At last we get a shaded scene: (there’s no difference right? but notice the ‘No cull forward’)

Performance: ( in fps, Power plant , camera at init position, 720p, 1024 lights, on radeon hd6870 )

MSAA           off         2x           4x           8x

Forward      161        139         122          91

Deferred     262       159         123          69

One thing that makes me unhappy is the performance is not beating deferred shading when using no or low MSAA. We still need to optimize the way. However, it do give us a posibility to use native MSAA and more complex shading models, right?

Code is attached here if you want to try or help me with optimization. (Plz rename it to zip). You need to download intel’s demo and replace these files. Media files are just to big to upload here, Sorry.

Btw, This is not to involve some war between deferred and forward. Myself is a fan of deferred shading, LOL.

At last, is the ‘Leo’ Demo making fun of somebody?

Compile Error C3918 , for event in Cli/C++

2010年10月13日 Leave a comment

In C#, we can check if an event variable is null before firing the event. For example:

public class MyClass
{
  public event EventHandler MyEvent;
  public void FireEvent()
  {
    // Check if there is any event handler registered.
    if (MyEvent != null) { MyEvent(this, new EventArgs()); }
  }
}

But if we do the same thing in C++/CLI, we will get an compile error C3918.

ref class MyClass
{
public:
  event EventHandler^ MyEvent;
  void FireEvent()
  {
    if(MyEvent != nullptr) // C3918
    {
      MyEvent(this, gcnew EventArgs());
    }
  }
};

Here is the solution:

public ref class MyClass
{
  EventHandler^ m_myEvent;
public:
  event EventHandler^ MyEvent
  {
    void add(EventHandler^ handler) { m_myEvent += handler; }
    void remove(EventHandler^ handler) { m_myEvent -= handler; }
    void raise(Object^ sender, EventArgs^ e)
    {
      // Check if there is any event handler registered.
      if (m_myEvent != nullptr)
      {
        m_myEvent->Invoke(sender, e);
      }
    }
  }
};

Categories: 未分类

金正日,请把你的大便留下来 zz

2010年05月11日 Leave a comment
作者:孙宇晨

我很小的时候,幼儿园的阿姨就曾告诉我,判断一个人的品质,主要是看他的朋友。从小我父母也告诫我不要交损友,影响自己学
习事 小,关键是不能影响自己的品质,比方说那些老是给老师打小报告,吃里扒外的人,我就从来不理。我有我的原则。

后来我长大了,学了历史,我越发感到国家如人,判断一个国家的品质,也主要是看他交了什么朋友。比方说我们国家吧,朝鲜古
巴,最近又多了个委内瑞拉,这些国家是什么德行,算不算好货,我相信大家都清楚。这事儿不需要把这几个国家都去过,再去趟美国比较一遍才能得出结论。

古巴,卡斯特罗做了五十年的皇帝,2008年才刚把位子给劳尔卡斯特罗,这叫兄终弟及,中国实行这种制度,还是3610年
前的事情,奴隶社会的时候了。古巴的宪法很牛屄,不像我国还有八个民主党派负责鼓掌,当个花瓶,居然白纸黑字的写着“决定在国内唯一永远不能出让的权利是
允许反革命重新组织起来反对祖国”,连个花瓶都不愿意摆。委内瑞拉总统查韦斯连任之后,为了让自己再当一任总统,居然修改宪法,取消民选官员的连任限制,
我们国家上次这么干, 也是百余年前,袁世凯当大总统的时候了。

但这些国家与朝鲜比起来,也只能算傻屄见牛屄了。

朝鲜的皇帝宝座快传到第三届了,比古巴的兄终弟及还算先进一千年,是个父死子继。金胖子作为法国轩尼斯干邑全球最大订购
商,对朝鲜最大的作用就是让朝鲜大多数人饿死,少数人饿而不死。朝鲜人民摊上这么一个五毛帝,真不知道上辈子是得罪了默罕默德,基督耶稣还是释迦摩尼,运
气直逼我们的父辈。

与这种国家交朋友已经够惊世骇俗的了,但我们的国家显然想在这条路上走得更远。
两个神奇的国家碰在一起,总有更神奇的事情发生。金正日近日访华,一行租下大连富丽华酒店西馆的所有360间总统套房,金正日本人的总统套房面积就达
750平方米,配有按摩浴缸蒸汽浴,一天费用达30万人民币。真让人怀疑里面是不是养的大象。乘坐的车辆为奔驰迈巴赫,500万人民币,大规模随行朝鲜
代表团有40多辆轿车和巴士,其意气风发,君临天下之气势,让我想起了隋炀帝下江南,但愿是来托孤,而不是选妃的。既然是下江南,这笔钱当然要由接待的奴
才出,金正日一行的住宿费和交通费全部由中国政府负担。

对于胡金会,有诗评:“金銮殿上,一路货色;鸭绿江边,两个流氓”。若对胡金会做个摘要,冗长的5点建议可以理解为:1.
有事跟我商量;2.我会替你兜着;3.缺钱尽管开口;4.人员技术我来培训;5.铁哥们就咱俩了(
http://news.xinhuanet.com/2010-05/07/c_1278775.htm
果然,会谈刚结束,金正日就从中国带走了一亿美元,我现在知道我国分配的又一中国特色了,纳税人的钱不仅贪官可以拿,代表可以贪,原来金胖子也是有一份
的。

更为惊天地,泣鬼神的事情在后面:中国为啥要对朝鲜这么好?有一个答案,我们是为了金正日的大便。

2006年1月
金正日访华时,我方为了解金正日的健康状态,便从金正日使用过的马桶上提取过尿液。可惜的是,今年,这个想法被金胖子发现了。韩国《东亚日报》报道说:
“朝鲜为了防止泄漏出金正日的健康状态,在中国停留期间,将金正日的大小便都运回朝鲜。”听说还是在上海集装起运,如果真是这样的话,上海世博会还真成了
“有屎以来粪量最重的一届”了,这不禁让我想起的陕西师大的黑板保护,也算是五十步笑百步吧。

不过钱都花到这份上了,作为纳税人,我还是衷心希望金正日能把大便留下来,如果连大便都带走,真会严重伤害中国人民的感
情,使源远流长的中朝关系蒙上一层不和谐的阴影。

不过回头细想,大便留不留也罢,反正都是索多玛。上帝曾经创造索多玛,可是索多玛的惊世骇俗让上帝以为这是撒旦创造出来
的。这两个撒旦之国碰在一起,在上帝毁灭它之 前,若你是义人,请你逃离索多玛,不要回头。

因为回头,你也会变成一桩石头。

2010/5/10

于北京大学

Categories: 娱乐

Let’s get rid of DirectInput

2010年04月22日 Leave a comment
http://msdn.microsoft.com/en-us/library/ee418864.aspx

"DirectInput is a set of API calls that abstracts input devices on the
system. Internally, DirectInput creates a second thread to read
WM_INPUT data, and using the DirectInput APIs will add more overhead
than simply reading WM_INPUT directly. DirectInput is only useful for
reading data from DirectInput joysticks; however, if you only need to
support the Xbox 360 controller for Windows, then use XInput
instead. Overall, using DirectInput offers no advantages when reading
data from mouse or keyboard devices, and the use of DirectInput in these
scenarios is discouraged."

I’m currently working on a wrapper class for keyboard and mouse inputs using this WM_INPUT and raw-input data. It will be here soon.

Categories: 未分类

Extensions to Luna to support userdata as return values from a lua-c call

2010年04月21日 Leave a comment
As a helper to Lua-C interops, Luna is pretty small and elegant without fucking usage of boost or python.
But there seems to be no support for userdata ( such as a C class pointer, witch will be used in lua ) as return values from a c function. So we made a bit extension.

With a public member function "push_userdata":

    static void push_userdata(lua_State *L, T* pT) {
        userdataType *ud =
            static_cast<userdataType*>(lua_newuserdata(L, sizeof(userdataType)));
        ud->pT = pT;
        luaL_getmetatable(L, T::className);
        lua_setmetatable(L, -2);
    }

and a macro

#define PUSH_USERDATA(ClassName, LuaState, UserDataPtr) Luna<ClassName>::push_userdata(LuaState, UserDataPtr);
#define DECL_SCRIPT_METHOD(FunctionName) int FunctionName(lua_State* L)
#define IMPL_SCRIPT_METHOD(ClassName, FunctionName) int ClassName::FunctionName(lua_State* L)

So we can use that:

//in .h
class B;
class A {
    B* m_pB;
public:
    B* GetB() { return m_pB; }

    DECL_SCRIPT_METHOD( GetB );
};

//in .cpp

IMPL_SCRIPT_METHOD(A, GetB)
{
    PUSH_USERDATA(A, L, this->GetB())
    return 1;
}

It seems working correctly. A new user data is created once the method is called from lua.
I’m doubtting if there are safety or efficiency problems with this implementation. Farther tests are taking on and hope your advise 🙂

Categories: 未分类