GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Sunday, 13 December 2009

Learned from HPG09 stuff!

Posted on 09:19 by Unknown
HPG09 site with program
there are slides and links to ACM paper..

is a join of Graphics Hardware and Interactive Raytracing stuff..

The three first is used by Optix (nvidia people):
1.Spatial Splits in Bounding Volume Hierarchies

new HQ GPU friendly acceleration structure option in Optix recommended dynamic data.
faster than kd-tree raytracing and faster in GPU contruction to Kd-tree?..

remember Optix also has this GPU very fast BVH:
Fast BVH Construction on GPUs

For more Optix stuff:
A. see doc folder Optix..
B. Overview: http://www.nvidia.com/docs/IO/67191/NVIRT-Overview.pdf
C. search slides and session in HPG program page..
D. NVIRT pdf in google stuff

2.Image Space Gathering
..

3.Understanding the Efficiency of Ray Traversal on GPUs

Timo aila
fastest raytracing CUDA kernels to date..
Bandwidth and no cache is not the issue for raytracing perf (lack of thereof)..
Adds persistent threads and improves current GPU imps..

2 new warp-wide instructions will help:
ENUM (Prefix sum) enumerates the threads (inside a warp) for which a condition
is true and returns a unique index [0;M-1] to those threads
POPC (population count)
Returns the number threads for which a condition is true, i.e. M above

Improvements for raytracing:
With ENUM + POPC, in Fairy scene
Ambient occlusion +40
%Diffuse +80%
Iff not limited by memory speed

Stream Compaction for Deferred Shading

Deffered shading adds the effect of code divergence of ubershaders..
Schedules shaders among conditions or shader types..
Best option uses radix sort, etc..
As future architectures add additional register store
and better switch handling, we expect the uber-kernel approach of
implicit serialization to scale better

A Parallel Algorithm for Construction of Uniform Grids


Efficient Stream Compaction on Wide SIMD Many-Core Architectures

Code: http://www.cse.chalmers.se/~billeter/pub/pp/index.html
presents like CUDPP library C++ oriented..

*Avoids explicit construction of a prefix sum with size=input data
*3x speedup previous aproaches
*Presents general SIMD width algorithms (CUDA,CAL,Larrabee)
*Presents both prefix sum and pop count based..
*Presents a CUDA Optimized version avoiding scattered writes via buffering the writes

Also I think to remember that all these things in found in parallel by Indians are worked in
Scalable Split and Gather Primitives for the GPU
which in turn is used for:
Fast Minimum Spanning Tree for Large Graphs on the GPU
i.e. new techniques for avoiding storying full scan (prefix sum) and scattering in final pass via buffering
they report more or less


CUDA:
Says popcount warp instruction not present (for a condition evualated for every element in a warp). Needs as Understanding the Efficiency of Ray Traversal on GPUs

That's true and you can't get an integer (32bit) which every bit is the condition evaualed to every element of a warp (32bit but amd wavefront 64bits)..
Really but if you could get a integer you have pop count:
__popc(x) returns the number of bits that are set to 1 in the binary representation
of 32-bit integer parameter x

In CUDa 1.2 compute and higher you get vote functions an all or nothing function for a condition..
Also CUDA 1.2 via shared shared mem atomics you can calc condition every threadid
and then do an OR atomic local mem to condition(threadid) lsh warpid
lsh says left shift..
Really CUDA 3.0 reveals ballot which perhaps is that function for returning an integer which used with popc we have pop count..

speedup vs. CUDPP

compaction 2.9× (compacts 64bit elems faster than 32bit (2x data))
Radix Sort 15% faster for >500k elems
Prefix Sum ‐ 30% faster

radix sort record
Fast Minimum Spanning Tree for Large Graphs on the GPU
This group has interesting things:

Papers:
Fast and Scalable List Ranking on the GPU
Singular Value Decomposition on GPU using CUDA
High Performance Pattern Recognition on GPU
CUDA Cuts: Fast Graph Cuts on the GPU
Accelerating Large Graph Algorithms on the GPU using CUDA

Soft:
http://cvit.iiit.ac.in/index.php?page=resources
Has cuda cuts source and example Codes for Shader Model 4.0:
Simple Geometry Shader
Simple Transform Feedback
Simple Layered Rendering
Motion Blur with Layered Rendering
Bicubic Patch Subdivision with Geometry Shader
Rendering Geometry Images with Geometry Shader
Have to test on Catalyst 10.1 with opengl 3.2 and geometry shader (current geometry shader has bugs with layers and integer tex fetches..)
See related:
Scalable Split and Gather Primitives for the GPU
A thesis more
Scalable Primitives for Data Mapping and Movement on the GPU:
http://cvit.iiit.ac.in/thesis/skpMS2009/

last thing is Nvidia people now photon mapping in image space similar to existing image space shadows and caustics..

Hardware-Accelerated Global Illumination by Image Space Photon Mapping
has code based in G3D 8.0

Efficient Depth Peeling via Bucket Sort
Fang Liu, Meng-Cheng Huang, Xue-Hui Liu, and En-Hua Wu
CUDA based there is a short paper with other technique by same authors in sigraph..
-“Single Pass Depth Peeling using CUDA Rasterizer” at SIGGRAPH 2009 talks

Data-Parallel Rasterization of Micropolygons With Defocus and Motion Blur
see post on tesellation and micropolygons..

Scaling of 3D Game Engine Workloads on Modern Multi-GPU Systems
more clear impossible..
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Porting CUDA to OpenCL!
    Well so you want to port CUDA code to OpenCL: you are in AMD GPU competition of porting Cuda codes to opencl (see previous post) or you are ...
  • Megapost!
    Today fools{ *GTX 485 is 512 cores 3gbytes gddr5 and 850/1750 shaders.. *ati 5990 has 4 gpus in board.. *bulldozer benchmarks }end fools.. A...
  • About ATI and Nvidia drivers (OCL included)!
    Hi I have been investigating AMD and Nvidia drivers.. for 10.3 there are 3d hooks support for 120hz monitors but is d3d9 d3d10 or d3d11 enab...
  • things found in CUDA forums
    Also some CUDA news: Mandelbulb stereo angalyph -> have to port to 3D Vision http://forums.nvidia.com/index.php?showtopic=150985&st=2...
  • opencl/opengl linux interop! seen in opencl cuda 3.0 sdk samples
    Following my OpenCL/OpenGL Window interop work: now has come to Linux  for Nvidia GPU computing registered developers via 195.17 driver! Als...
  • State of the blog..
    Sorry for the delay guys of posting code of Apple OpenCL demos port.. the blog has been with no updated for more than 2 weeks in this rapid ...
  • Optix and OpenCL SDKs with Visual Studio 2010
    Optix 1.0 ========= install cg download Cmake 2.80 cmake says error dumpbin not found and it is cuda doesn't work with vc2010 so copy pt...
  • CUDA 3.0 forums stuff!
    1.Getting CUBIN instead of ELF If you need the older text format, you can disable ELF cubins in nvcc.profile by changing "CUBINS_ARE_EL...
  • News from the web!
    Some things learned in AMD forums: 1.Why 3xxx no OpenCL: Compute shader mode is a hardware feature that did not exist in the HD38XX line of ...
  • Shaders: measuring perf, source translation and parsing different languages!
    Hi, I hope to be pretty exhaustive of options for parsing and translating between graphics and compute shaders ( some open source) For DX sh...

Blog Archive

  • ►  2013 (5)
    • ►  September (1)
    • ►  March (3)
    • ►  February (1)
  • ►  2012 (1)
    • ►  December (1)
  • ►  2010 (46)
    • ►  July (4)
    • ►  May (1)
    • ►  April (3)
    • ►  March (9)
    • ►  February (15)
    • ►  January (14)
  • ▼  2009 (125)
    • ▼  December (51)
      • GPU computing on AMD.. an history perspective!
      • Catalyst 9.12: hotfix (III)
      • Catalyst 9.12 Linux and Windows links and release ...
      • Source code of DirectCompute bechmark(OpenCL and D...
      • Catalyst 9.12 adds OpenGL 3.2 support (and more..)!
      • 16/12 news!
      • Catalyst 9.12 released
      • PS3 OpenCL work and AMD OpenCL ICD
      • Christmas Wish list (I): Monitors
      • 3d Stereoscopic players!
      • Today news!
      • What will I do if I have 3D Vision OpenGL QB
      • GLEW,GLUT,Freeglut, MesaGLUT and more
      • Nvidia 195 new drivers and Flash player beta 2!
      • Running ATI GPUs in Sisoft Sandra 2010!
      • Memcheck GPUs!
      • Emulate 3D kernel launch grid
      • things found in CUDA forums
      • Siggraph 2009 (Asia too..)!
      • Architecture ideas for future GPUs!
      • Dificulties in coding, achieving high perf an meas...
      • Learned from HPG09 stuff!
      • Nvidia driver 187.98 add new files!
      • What I would want to know and get from vendors par...
      • What I would want to know and get from vendors par...
      • Some news II (post #100!)
      • What I would want to know and get from vendors par...
      • physics on GPU: source code!
      • OpenCL with MingW! (and more)
      • Some news!
      • String matching on GPUs!
      • Lots of OpenCL soft coming!
      • 10 Raytracing GPU demos! (more or less)
      • New Nvidia tools and crossvendor GPU instrumentati...
      • About Catalyst 9.12 and 10.1!
      • CUDA 3.0 forums stuff!
      • Upcoming GPU tutorials!
      • News from the web! (9 December)
      • Compiling the CUDA compiler!
      • Understanding Nvidia GT200 GPU and CUDA implementa...
      • Open Source GPU Computing benchmarks
      • CUDA TopCoder contest stuff (with source code of t...
      • CUDPP news!
      • DirectCompute stuff!
      • Nvidia GPU computing news!
      • GPU Computing calendar for December 09 and January...
      • Nexus FAQ!
      • Nvidia Nexus beta1 GPU debugger shipped!
      • GPU virtualization (and what to expect in VMs)!
      • AMD OpenCL news! (almost all..)
      • News posted 2/12/2009! (megacompilation)
    • ►  November (53)
    • ►  October (21)
Powered by Blogger.

About Me

Unknown
View my complete profile