Learned from HPG09 stuff! ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

HPG09 site with program
there are slides and links to ACM paper..

is a join of Graphics Hardware and Interactive Raytracing stuff..

The three first is used by Optix (nvidia people):
1.Spatial Splits in Bounding Volume Hierarchies

new HQ GPU friendly acceleration structure option in Optix recommended dynamic data.
faster than kd-tree raytracing and faster in GPU contruction to Kd-tree?..

remember Optix also has this GPU very fast BVH:
Fast BVH Construction on GPUs

For more Optix stuff:
A. see doc folder Optix..
B. Overview: http://www.nvidia.com/docs/IO/67191/NVIRT-Overview.pdf
C. search slides and session in HPG program page..
D. NVIRT pdf in google stuff

2.Image Space Gathering
..

3.Understanding the Efficiency of Ray Traversal on GPUs

Timo aila
fastest raytracing CUDA kernels to date..
Bandwidth and no cache is not the issue for raytracing perf (lack of thereof)..
Adds persistent threads and improves current GPU imps..

2 new warp-wide instructions will help:
ENUM (Prefix sum) enumerates the threads (inside a warp) for which a condition
is true and returns a unique index [0;M-1] to those threads
POPC (population count)
Returns the number threads for which a condition is true, i.e. M above

Improvements for raytracing:
With ENUM + POPC, in Fairy scene
Ambient occlusion +40
%Diffuse +80%
Iff not limited by memory speed

Stream Compaction for Deferred Shading

Deffered shading adds the effect of code divergence of ubershaders..
Schedules shaders among conditions or shader types..
Best option uses radix sort, etc..
As future architectures add additional register store
and better switch handling, we expect the uber-kernel approach of
implicit serialization to scale better

A Parallel Algorithm for Construction of Uniform Grids

Efficient Stream Compaction on Wide SIMD Many-Core Architectures
Code: http://www.cse.chalmers.se/~billeter/pub/pp/index.html
presents like CUDPP library C++ oriented..

*Avoids explicit construction of a prefix sum with size=input data
*3x speedup previous aproaches
*Presents general SIMD width algorithms (CUDA,CAL,Larrabee)
*Presents both prefix sum and pop count based..
*Presents a CUDA Optimized version avoiding scattered writes via buffering the writes

Also I think to remember that all these things in found in parallel by Indians are worked in
Scalable Split and Gather Primitives for the GPU
which in turn is used for:
Fast Minimum Spanning Tree for Large Graphs on the GPU
i.e. new techniques for avoiding storying full scan (prefix sum) and scattering in final pass via buffering
they report more or less

CUDA:
Says popcount warp instruction not present (for a condition evualated for every element in a warp). Needs as Understanding the Efficiency of Ray Traversal on GPUs

That's true and you can't get an integer (32bit) which every bit is the condition evaualed to every element of a warp (32bit but amd wavefront 64bits)..
Really but if you could get a integer you have pop count:
__popc(x) returns the number of bits that are set to 1 in the binary representation
of 32-bit integer parameter x

In CUDa 1.2 compute and higher you get vote functions an all or nothing function for a condition..
Also CUDA 1.2 via shared shared mem atomics you can calc condition every threadid
and then do an OR atomic local mem to condition(threadid) lsh warpid
lsh says left shift..
Really CUDA 3.0 reveals ballot which perhaps is that function for returning an integer which used with popc we have pop count..

speedup vs. CUDPP

compaction 2.9× (compacts 64bit elems faster than 32bit (2x data))
Radix Sort 15% faster for >500k elems
Prefix Sum ‐ 30% faster

radix sort record
Fast Minimum Spanning Tree for Large Graphs on the GPU
This group has interesting things:

Papers:
Fast and Scalable List Ranking on the GPU
Singular Value Decomposition on GPU using CUDA
High Performance Pattern Recognition on GPU
CUDA Cuts: Fast Graph Cuts on the GPU
Accelerating Large Graph Algorithms on the GPU using CUDA

Soft:
http://cvit.iiit.ac.in/index.php?page=resources
Has cuda cuts source and example Codes for Shader Model 4.0:
Simple Geometry Shader
Simple Transform Feedback
Simple Layered Rendering
Motion Blur with Layered Rendering
Bicubic Patch Subdivision with Geometry Shader
Rendering Geometry Images with Geometry Shader
Have to test on Catalyst 10.1 with opengl 3.2 and geometry shader (current geometry shader has bugs with layers and integer tex fetches..)
See related:
Scalable Split and Gather Primitives for the GPU
A thesis more
Scalable Primitives for Data Mapping and Movement on the GPU:
http://cvit.iiit.ac.in/thesis/skpMS2009/

last thing is Nvidia people now photon mapping in image space similar to existing image space shadows and caustics..

Hardware-Accelerated Global Illumination by Image Space Photon Mapping
has code based in G3D 8.0

Efficient Depth Peeling via Bucket Sort
Fang Liu, Meng-Cheng Huang, Xue-Hui Liu, and En-Hua Wu
CUDA based there is a short paper with other technique by same authors in sigraph..
-“Single Pass Depth Peeling using CUDA Rasterizer” at SIGGRAPH 2009 talks

Data-Parallel Rasterization of Micropolygons With Defocus and Motion Blur
see post on tesellation and micropolygons..

Scaling of 3D Game Engine Workloads on Modern Multi-GPU Systems
more clear impossible..

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Sunday, 13 December 2009

Learned from HPG09 stuff!

0 comments:

Post a Comment

Popular Posts

Blog Archive

About Me