July 2010 ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Saturday, 10 July 2010

Some news!

Posted on 11:16 by Unknown

News:
*Gpu computing gems 1 or GPU gems 4 source code already avaiable in gpucomputing.net:
Book for November..
Right now:

Title
A Programmable Graphics Pipeline in CUDA for Order Independent Transparency	1 new	07-10-2010
High Performance Iterated Function Systems	0 new	07-02-2010
CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm	0 new	07-01-2010
Connected Component Labeling in CUDA - demo+code	0 new	06-30-2010
A Practical Guide toMassively ParallelMonte Carlo Simulations: The Ising Model	0 new	06-30-2010
Parallel LDPC Decoding using CUDA	0 new	06-30-2010
Path Regeneration for Random Walks	0 new	06-30-2010
GPU Gems 4: Deformable Volumetric Registration using B-splines Source Code	0 new	06-30-2010
Monte Carlo Photon Transport on the GPU	0 new	06-30-2010
Lattice-Boltzmann Lighting Models - Source Code	0 new	06-30-2010
RNA folding GPU	0 new	06-30-2010
Haar Classifiers for Object Detection with CUDA: Pixel-parallel processing kernel	0 new	06-29-2010
Multiclass Support Vector Machine	0 new	06-29-2010
Parallelization of the x264 encoder using OpenCL	0 new	06-21-2010
Cone-Beam CT image reconstruction using the Katsevich Algorithm	0 new	06-21-2010
Line forward projection on CUDA	0 new	06-11-2010

seems MareNostrum getting a rack of Fermis perhaps with IBM Power7

see now Nvidia would have to publish a PowerPC arch CUDA driver?

Or using PathScale with full open source based computing stack..
avaiable here branch from noveau:

http://github.com/pathscale/pscnv/commits/master

Seems Nvidia TCC supporting driver Fermi in IBM web site version 197.81

Catalyst 10.8 beta seems avaiable 10.7 coming 21/7..

Physx 3.0 coming with CPU improvements:
*auto threading
*sse enabled by default
Mafia has new runtimes NVIDIA PhysX driver: 10.04.02_9.10.0522.
Mueller has post paper of Fermi launch demo using water heigh fields plus particles..
Two other papers interesting from Nvidia research are:

HLBVH: Hierarchical LBVH Construction for Real-Time Ray Tracing
PantaRay: Fast Ray-traced Occlusion Caching of Massive Scenes

Hwu based course from Stanford:
http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Two interesting conferences program avaiable:

PACT
has intel gpu paper demystifying ..
also Revisiting Sorting for GPGPU Stream Architectures
which achieves near 500mkeys/s on gt200..

there is a workshop on gpus
http://informatik.technikum-wien.at/gpusca/
and web doesn't work.

The Nineteenth International Conference on
Parallel Architectures and Compilation Techniques (PACT)
Vienna, Austria, September 11-15, 2010

Interesting papers:
Scalable Thread Scheduling and Global Power Management for Heterogeneous Many-Core Architectures
Dynamically Managed Multithreaded Reconfigurable Architectures for Chip Multiprocessors
WAYPOINT: Scaling Coherence to Thousand-core Architectures
Scalable Hardware Support for Conditional Parallelization
Less is More: Trading off Work-Efficiency for Scalability in Irregular Programs
Revisiting Sorting for GPGPU Stream Architectures
D. Merrill, A. Grimshaw
An Integer Programming Framework for Optimizing Shared Memory Use on GPUs
W. Ma, G. Agrawal
DMATiler: Revisiting Loop Tiling for Direct Memory Access
A Software-SVM-based Transactional Memory for Multicore Accelerator Architectures with Local Memory
Automatic Vector Instruction Selection for Dynamic Compilation
An OpenCL Framework for Heterogeneous Multicores with Local Memory

SC10

I would like to review this papers:
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
Parallel Fast Gauss Transform
Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers
The Multi-Scale Heart Simulation on Massively Parallel Computers
Using 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
An 80-Fold Speedup, 15.0 TFlops, Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code
Exploiting 162-Nanosecond End-to-End Communication Latency on Anton
Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories
Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Scalable Graph Exploration on Multicore Processors
The 48-core SCC processor: the programmer’s view
Exploring a Novel Gathering Method for Finite Element Codes on the Cell/B.E. Architecture
Reducing Multicore Bandwidth Requirements for Combinatorial Multigrid
Diagnosis, Tuning and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method
Scaling Hierarchical N-Body Simulations on GPU Clusters
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance
The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches

Posted in | No comments

Sunday, 4 July 2010

DirectCompute Double precision Mandelbrot demo and more..

Posted on 19:41 by Unknown

In addition to first demo using double precision on GL 4.0 here now on DirectCompute:
THIS DEMO NEEDS DX JUNE 2010 RUNTIMES
so update if needed

this test on AMD shows a ATI DirectCompute DPFP bug.. it shows incorrect rendering..

Also note I learned DirectCompute doesn't admit division with doubles so I have to change /2 with *0.5.
Nvidia Fermi works OK!
DirectCompute Double precision Mandelbrot (includes source based almost 100% on Voxilla demo):

use test.bat app starts at big zoom so it shows DP in action.. if you exit with esc then shows same rendering at SPFP.. note with mouse you can zoom in out..

bat calls mandel.exe 0 for SP or mandel.exe 1 for DP..

Also note I expected better perf for AMD than Nvidia but two work very slow i.e. Nvidia runs at full speed (i.e. capped 8x vs Teslas) but AMD has perf issues as it should run at least 3-4x vs Nvidia Fermi..
Also has vector mode running somewhat faster than scalar shader (but not much could run up to 4x faster if compiler didn't extract perf of scalar code but runs not much faster compared to SP where vector code outperforms scalar code by a higher amount).. fermi perf is unaffected by using vector code..

Correct behavior:

Double precision (on GTX 470)

See full Window

Single precision

On AMD 5850 DP renders as (i will post image soon):

Related also I patched Nvidia Physx Demo to work on AMD changing GLSL code using Cg non standard functions.. it exhibits some OpenGL bugs.

Instructions:

Download Nvidia Physx Demo here ((select FLUIDS: TECHNOLOGY DEMO)

and use this exectuable for running on AMD cards (extract on demo dir).
It shows artifacts on AMD card not on rendering but on desktop outside of program window..
On AMD 5850 DP bad renders as (i will post image soon):

Posted in | No comments

A lot of things you probably don't know.. and a worth it..

Posted on 12:06 by Unknown

*TCC support for GF100 products will be out next week also this drivers will add support for simultaneously running this drivers with normal graphics drivers (that support OGL,DX,DXVA,etc..) I suspect graphics and TCC driver will have to have same version as both write dll's in windows system..
I hope still inf trick works so I can enable on Geforce Fermi and also that this works with Nsight also.. anyway is not severe as 25x drivers seems to add support for CUDA cards (Geforces even) without extending desktop on it so kernels exec time needn't be time limited for TDR.. before it required to use two Nvidia cards and one can be not desktop extended but if you used say a ATI card and a Nvidia card without desktop extended on Nvidia so to use Nsight for example (which requires no desktop extended) it will fail since CUDA will not find a CUDA card..
*There is support for Fermi on MacOs right now on Nvidia 19.5.8f03 drivers released month before but wuthout reposting so have NVDAGF100HAL.kext..
Anyway it only works OGL support as both CUDA and OCL don't use it..
I have to use NVloader injector which anyway doesn't work with Fermi on 64 bit kernel mode.. note gf 275 works in 64 bit with this injector also..
note i wanted to fix and all I found was a cuGetExportTable and something like MacCompatibiltyTID used by a checkcompatibility executable perhaps fixing it will work..
One in Nvidia forums assumed OCL broken fixed creating a OGL context beforce searching for OCL devices (oclgetdevice) but this trick didn't work..
*Storing ELF binaries instead of CUBIN deletes use of decuda hopefully one very interesting solution is..

*Seeing MAGMA webinar seems big release for SC2010 with some big features check magma presentation for what to expect..
*Physx 3.0 nearing to launch as Physx Visual Debugger includes support for it in release note says..

Note this brings concurrent kernels support for Fermi for improved perf on physics simulations.. hopefully also includes wrinkle meshes feature studied by Mueller.

Note also GPU AI notes once Function pointers supported on CUDA will use it so expect a new release sometime optimized even more for Fermi too..

Probably anuonced at Siggraph.. even launching later..

Hope too see also APEX shipping for other than Big AAA games i.e. downloadable for everyone..

Lastly I expect Optix 2.0 and Cg 3.0 final for Siggraph and let's see also in time OpenRL with OpenCL support for GPUs would be interesting for ATI.. Note also Luxrender GPU 1.6 brings Stocasthic Photon Mapping and uses OCL on ATI GPUs also..

*Nsight also is moving fast from beta in early June now is RC state.. launching at siggraph?

*ATI Doubles on DirectCompute are broken.. altough feature flag is supported..

now we can test it with June DX compiler before it was broken for doubles inside control flow (loops, if,etc..)
Mainly compiling works but rendering shows issues vs Fermi which supports nicely..
Download my code.. (coming soon..)
*ATI GLSL driver is somewhat broken at least seems to geometry shaders as I fixed Nvidia Physx fluid demo to use non Cg code on GLSL code and some other fix related to point rendering and now seems to work but not without instabilities present as noise in screen even outside the window it fills..
Download ant test.. (coming soon..)
Also GLSL driver don't implement fetching integer textures with integer coordinates (texel2Dfetch( itex))
*CUDA 3.1 ships with three interesting examples: one is oclTridiagonal a fast tridiagonal solver.. interesting for a DoF cinematic renderer as in Metro using OCL/OGL..
other one is oclCopyComputeOverlap shows two things one is that concurrent kernel and exec is possible in OCL.. via command queues also shows there is an issue in 25x drivers that prevent full scaling I think good is 30% faster code and I obtain 20% on 25x drivers.. on 197 drivers I obtain 30%..
note that on both ATI and Apple platforms even with Nvidia GPUs exhibit no scaling and even negative scaling (-15%)
Good is that is fixed issue in 258.19 OCL 1.1 preview drivers with report CUDA 3.2 so I obtain back 30% overlap.. Note that other 258 drivers don't work (as they report older CUDA code 3.1 and OCL 1.0)..
One more interesting thing is that supposedly even dual dma engine is suposed to work on ocl so overlap would be 50%.. seems restricted to Tesla but Nvidia has been less detailed than double capping on Geforce..
Luckily I have a trick for you 197.44 driver seem to support Dual DMA engine on Geforce Fermi too!
This is OGL 4.0 driver so all you lost to current 256 drivers is CUDA 3.1 features only.. Linux also use OGL 4.0 driver on developer.nvidia.com and you have it...
Note also 197.75 etc don't work only work with this..
*So seems DUAL DMA engine is broken/disabled on Geforce Fermi without any reason other than economical..
*CUDA simpleStream seems to show broken streams on Fermi but it's due to not sending enough work.. a simple fix..
*Matmul by Lschien is one of the fastest ones for CUDA but it fails currently on fermi due to using cubins with obtained modifing tesla asm via decuda cudaasm.. thanks god seems related to volatile keyword don't working correctly pre cuda 3.0.. author suggest a fix assuming this works that uses cuda variant 6.. I have tested and it works so it's fixed I obtain near 850Gflops on Fermi 470 at 1650Mhz..
*Lot of soft updated to CUDA 3.x even 3.1 right now: NPP 3.1,CULA 2.0, JACKET 1.4,OpenMM 2.0 on Zephyr SVN, Gromcas 4.5 beta,GMAC, etc..

More news:

Also Nvidia has released a lot of drivers on 256 brach lets see rough differences/progression:
197.44 first OGL 4.0 driver and also unique supporting Dual DMA engine on Fermi on on Tesla/Quadro boards.. also has no issues in single dma..
256 add cuda 3.1 currently all has issues in concurrent kernel and exec on Fermi at least on OCL
257.15 bluray3d
257.19 nsight june beta drive
257.21 whql (supports nsight)
257.29 ion support accelerated dxva flash with pciex 1x devices
258.18 ocl 1.1 beta (says cuda 3.2!) fixes oclCopyCompute issues (but single DMA on Fermi)
258.48 first supporting Quadro Fermis..

258.69 shipping with 3d vision surround (Nvidia ntersect says youtube 3d support coming soon.. also I hope they add windows DX 3d vision support soon..)

Some other striking news :-) are:
*OpenCurrent 1.1 ships with CUDA 3.0 and multigpu code..
well I have been testing with CUDA 3.1 because I have Ubuntu 9.10 and with CUDA 3.1 GCC 4.4 works ok (so Ubuntu 10.4 is right also..) and has some issue related to now supporting true functions I think I must add some static to a function as cuda 3.1 release notes porting guide says.. with CUDA 3.0 GCC 4.4 doesn't work so I have to check with a Ubuntu 9.04 if I don't fix..
*OpenMP to CUDA compiler is avaiable in Cetus 1.2.
*PGI 10.6 is avaiable integer support in kernels and VS 2010 support at least.

I have tested GATLAS and is good at least 260 gflops on a gtx 275.. and I tested on MAC so at least works in Lin and Mac without much work and says author with 5870 and stream 2.1 achieves some image kernels 1,3 tflops so similar to cal++ matmul in OpenCL! have to test or modify code(?) for double testing..

Some tricks and work to do:

RAW DATA:
I know its lame but at least you can emulate 3d image writes on cuda with surfaces using ptx 3d tricks (post later).
I have to put a sample of CUVID on MAC.
SimpleStreams in cuda seems fermi bad in forums says increase work to 500.
matmul chien says put volatile and check (works!)
bsgp fermi support checking mail with author..
sparse matrix ati code test on fermi..

See fermi benchmarks:
nvidia benchmarks in blog
openvidia benchmarks..
cula blog
jacket blog
same papers of hpg2010 presentations billeter scattering and aov mcguire..
seems also code of rasterization and color stocastic shadow map coming soon..

Posted in | No comments

Saturday, 3 July 2010

ATI Stream SDK roadmap

Posted on 06:40 by Unknown

I have found a roadmap of ATI Stream SDK till end of year:

DISCLAIMER: It's on Internet and found with some luck.. no breaking of NDA

Let's talk about it..
currently AMD OpenCL lacks:
*opengl interop issues:images interop issues (for example copy buffer to image where image is opengl tex acquired doesn't work)
*expose multiple component images (other than rgba)
*DX interop
*expose all graphics mem (currently 128-256mb)
*Catalyst integration

Stream SDK 2.2 Adds:
*OCL 1.1 (3 component vectors is part and image support ocl 1.1 is multiple component images (r,rg,rgb))
*DX10 interop (seems only that no dx9 or dx11 as Nvidia has)
*mem fences don't generate unneeded barrier isa instructions
*append buffers (what about also about GDS extension)
*seems atomics ocl 1.1 is nothing new? and offline compilation goes final from preview and dpfp adds fma as others are supported now(?)
dpfp fma should allow peak test kernels in benchmarks showing high numbers.. near 400-500gflop/s..

A lot more interesting is 2.3:
*In process compilation of OpenCL kernels means no shipping LLVM compilers (llc,etc..) and hopefully means will be integreated in atiocl.dll so it can ship OpenCL builtin in Catalyst 10.12..
*Library models
*C++ template support in kernels (I hope this means you can specify at least kernels args depeding on template argument for supporting double and float kernels with one code for example similar to CUDA support)
*Adds trig DPFP routines (but still no complete DPFP support seems so horrible as Nvidia shiping since October 2009 and AMD said support coming gradually since end 2009)
The more interesting is last three:
*FFT library: why not also a blas lib, I suspect is ocl based as directcompute has its fft lib
also is going to be part of acml? currently matmul in acml gpu is cal based..
At least I hope to be only binary library and also for Win and Lin so for Mac I hope somehow we can extract OpenCL kernels or create a wrapper around it and use Wine or something like this to test perf on MAC on AMD boards is correct..
*OpenPhysics: well at least some to play, I expect cloth, soft body and SPH particles support in OpenCL and/or DirectCompute.. well in bullet site there is a preliminary executable with cloth demo and AMD worker talking about state of soft body support (http://code.google.com/p/bullet/issues/detail?id=390#c3) seems since last week also we have directcompute and opencl code for both cloth and soft body in trunk..
Also by September we will have DMM 2.0 as said in GDC that has some OpenCL love for this rigid body+fracture simulatior..
*OpenDecode UVD: Well a cuvid/vdpau library for AMD boards.. Nvidia has put lot of love to GPU video decoding and interop with CUDA/OpenGL with CUVID for Win and Mac and VDPAU for Linux..
VDPAU has since 256 drivers efficient OpenGL and CUDA interop.. CUVID has by def efficient CUDA interop and fast OpenGL/DX interop in Windows.. CUVID for MAC only seems good for feeding data to CUDA as OpenGL interop in MAC is slow right now (and has been so, since ever)..
I expect this brings fast interop to OpenCL on Win and Lin and that adds to DXVA DX interop on Win and AMD xvBA on Linux which VAAPI wrapper seems to provide fast OGL interop..
So Mac seems left but I hope recent video acceleration API on 10.6.3 supports AMD 5xxx cards when released and also that VC1 support is added in addition to h264.. I think this provides fast path to OpenGL textures so as OpenCL/OpenGL interop is fast on Apple provides also OpenCL interop on that platform..
Another thing is if Dual Stream acceleration will be exposed and supported.. on Nvidia I think both DXVA,CUVID and VDPAU expose with a GTX 470 at least..
Also related is Catalyst 10.7 having improved support for VLC 1.1.1 DXVA decoding for AMD cards which I presume relates to fast path GPU/CPU sending of frames works..
Remember also last month Nvidia released a ION driver (257.29) improving perf with DXVA on ION with PCIex x1 as Flash requires (GPU->CPU->GPU roundtrip)..

What's left after OCL 1.1 and stream sdk 2.3:
Well I expect Global Data Share and shared registers extensions,3d image writes, true complete DPFP support (cl_khr_fp64), complete BLAS and FFT lib (as CUBLAS and CUFFT in CUDA), pinned mem working, host mem accessible from GPU extension, gather4 instructions for image support in OpenCL, and working concurrent kernel and mem transfers (i.e. concurrency in oclCopyCompute CUDA 3.1 example >=20%)

Posted in | No comments

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Saturday, 10 July 2010

Some news!

Sunday, 4 July 2010

DirectCompute Double precision Mandelbrot demo and more..

A lot of things you probably don't know.. and a worth it..

Saturday, 3 July 2010

ATI Stream SDK roadmap

Popular Posts

Blog Archive

About Me