GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Sunday, 8 November 2009

More than 10 places where DX Compute 5.0 is better than OpenCL!

Posted on 05:30 by Unknown
Sorry xbitlabs but I don't think this..
Please understand most of my thoughts are based in using one  API vs other today or in the near future and
based upon some common sense on companies involved..

The whole point is that comparing OpenCL 1.0 and DX Compute bare metal DX Compute gains in functionality and richness of features to OpenCL.. Also because of the potential broad market an use in blockbuster next gen games vendors will spend more time optimizing DX Compute drivers.. at least in the early days which is today.. this is true and ATI ships phenomenal DX Compute drivers for 5xxx series..
Nvidia has done a great effort in CS 4.1 which anyway continues to be more rich that OpenCL 1.0
Also as with OpenGL vs DX development today there is the mad situation of OpenGL of having to  have multiple rendering paths in function of extensions if you want to use the last feature and DX having almost no cap bits..
In DX compute you not only can use all that is avaiable (which is more than OpenCL by default) also is expected to be fast see 5

Let's see why..

OpenCL being targetted to HPC and scientific applications and being brought forward by a lot of companies has not enabled by default a lot of functionality shipping in old CUDA hardware ( I mean atomics for example)..
this is for a lot of vendors can claim compliance (say CPUs,GPUs (also S3), PS3 Cell (embedded profile), Power SGX chips(embedded profile)..
also as there a also not GPU implementations a lot of graphics stuff is optional..


Anyway OpenCL regains with current extensions a lot of the current functionality to no DX Compute but also CUDA..
As always (I mean as in OpenGL drivers) Nvidia has done a good work and today in 195 drivers for both Linux and Windows OpenCL with good performance and also atomics,byte addresable mem,graphics interop, and double support..

Anyways don't get pessimistic as the situation is similar today with OpenGL there are major vendors shipping mediocre drivers (hi Intel) but anyways if you stay with Nvidia (ok.. AMD does quite well today too) you can get OpenGL 3.2/3.1 drivers and a lot of other extensions.. and that enables today Nividia to ship almost equivalent functionality in DirectX 10.1.. What about DirectX11 well Nvidia has some cooked functionality in 195 drivers and I don't doubt they will have same day as Fermi is released a lot of extensions to bringing in parity to Direct3D 11.. And also that will to Linux users.. In fact Nvidia has always enabled to the point of Geforce 8800 launch demos being OpenGL ones the majority and using thier DX10 OpenGL propietary extensions (which anyways in less or more time and with some makeup get into ARB or the GL Core functionaliy)..

Details next:

1. Graphics interop
=============
I mean with APIs is builtin needed not extension (cl_khr_gl_sharing) and shipping today both vendors..
Only Nvidia in OpenCL.. ATI working on it..
Also while is supposed to be without copies on both APIs I believe that somewhat can be a fool on some early drivers in OpenCL.. DX Compute I feel is good implemented in both vendors today anyway..
Bad for DX Compute is only DX interop while OpenCL gets OpenGL and  will get DX interop (at least supposed on ATI)


2. Image support builtin
================

Using textures from cs is not an extension and shipping today both vendors.
Only Nvidia in OpenCL today..

 ATI working on it..

3. CS (5?) can write to backbuffer directly
=========================
See Voxillas Mandlebrot demo for ex.
Best OpenCL can do is render to a renderbuffer or texture attached as a color buffer of a FBO using OGL interop which is an extension..
Needs anyway using a copy Frambufferblit to the backbuffer..

4. DX Compute has local, global atomics and byte addresable mem (RWByteBuffers) by default (CS 5.0)
================================================================
This are currently shipping today both vendors (well Nvidia without atomics becuase isn't in CS 4.1)
In OpenCL we only have in Nvidia today..
ATI working on it..

5. Local mem is always fast hardware local mem (shared mem nvidia,lds ati)
=================================
CS 5.0 exposes general r/w to local mem as DX11 has it (Nvidia has since G80 but ATI not in 4xxx)
Cs 4.1 exposes limited write abilty similar to 4xxx so 4xxx can be used very fast in this concrete cases..
In OpenCL 4xxx local mem is emulated using local mem so than can get programs say 10-20 times slower than in 5xxx and also fool programer thinking using local mem will get code fast and get slow (because double mem compies if is buffering to local  mem)

6. 3D writable textures (CS 5.0)
======================
Altough 2D writable textures without copies have not been in CUDA world until May 2009 (well in non beta form) DX Compute comes with 3D writable textures.. name RWTexture3D..
a guy named Voxilla has released a 3D wave equation solver using this functionality..
it allows up to 250fps of a 400^3 of floats impressive sutff.. 16gpoints/sec which using a FTD of 10 o 11 flops per point is equal to 180 Gflops and better about 500Gbytes/s bandwith..
they Bw are so high because we are writing to memory that is cacheable and better yet tuned for spatial locality at least 2D (not linear mem locality) (global mem is not at least in pre Fermi days or 5xxx days)..
Well there is a trick by a user (that I will explain one day) that allows to write to 3D Textures in CuDA but it's not very efficient and possibly not future proof (the user reversee enginered in mem how 3D textures where stored) and write code to read and write to a specific pos..
ATI shipping today..
OpenCL has this functionality as an optional extension and expect requiring DX11 hardware in GPUs there is no support today (ATI is a bit later in OpenCL extensions and Nvidia has no DX11 hardware)..
So we hope in Fermi time.. ATI will not get eariler



7 No builtin support for append consume buffers
=================================

Not extension now for OpenCL.. AMD supposedly working on it?

8 Interop with shaders which can do scatter to textures
=====================================
This is a SM5.0 killer feature of fragment shaders write with random access to mem within shaders can be good for example to an performant and memory efficient Order Independant Transluceny..
In OpenCL you need OpenGL with some equivalent DX11 extensions (anyway both AMD and Nvidia are working on it and AMD ships in current drivers but no doc)

9. ATI has DX11 optimized vs OpenCL
==========================
Believe me or not, but ATI has very good Cs drivers as games are coming but OpenCL drivers are rude..

10.Autovectorizing& MultiGPU
=====================

Compilear Shaders have autovectorization for pre G80 cards and also for 5way SIMD in ati cards right now this is not supported on AMD where would be needed  for code to go full speed..
Its also not for AMD Cpus for using SSE trough
Anyway if I make the effort to try to write in vector code (old Cg days) I want at least to write on time so I have to test if say float8 or float16 which would go well for AVX and Larrabe are also efficient on CPUs with SSE on AMD platform and AMD GPUs i.e for example troughput using float8 is half of float4 and  float16 1 quarter.. also seens that there is a compiler hint (I don't know if it's get by the compiler or the users directs the compiler for vector size optimal).. so I have to see it how to perhaps write efficient and variable floatn code using info from/to compiler..
IBM on the Cell SDK says to autovectorize code inside and across workgroup threads so seems very good..
but note that by default CEll compiles to 1 item per workgroup unless you request it in code.. perhaps in this case all is lost..
Also Intel upcoming OpenCL seems to be able to use autovectorization as with shader compilers..
as using floatn
ATI has expressed interest in trying to get it but it seems at least 6-12 months of waiting if not 1-2 years..
judging  by OpenGL driver..
Nvidia is the more lucky in this respect as maps good to a scalar one in concept (learn about warps..
All this is said because I think that DX Compute on AMD is using autovectorization since I think it's done by the same driver team of DirectX that is good at compiler shaders.. and benchmarks of DXNvidia Ocean demo on ATI  I have to test with scalar and vector throughput kernels..
Last thing to note is about MultiGPU anandtech showed using multigpu nvidia ocean have to see if it's using
2 devices for compute shaders or drivers as for shaders have support for it..
No can be general Obviously. at least if more than 1 kernels are launched and memory is changes or if atomics? are used

11. Guarantess minimums values for shared mem size and workgroup (?) and 3D grids
===========
I will have to check what I say but I think DX Compute requires support for running in some minimum of elements and also minimum fast local mem size..
OpenCl requires no local mem(?) and workgroups can be of size 1 (at least on CPU, as Apple use 1)
DX Compute 32Kb and some minimum size for groups and 3D grids are required..
This allows for 1 implementations that works in every DX11 device and OpenCL imp have to have fallbacks if wanted to run on CPU for ex.
Also assuming for ex s3 uses 1 item per workgroup and has no local mem what has about pixel shaders (mm.. scatter to mem)

12. Images read and write simultaneous.. ?
===========================================
In openCl __readonly and __writeonly. DX? RWTexture ?
at least sm 5.0 alow fragment shaders to do it but similar to nv_texture_barrier
also mem scatter and global atomics so you lose local groups and shared mem and __syncthreads
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Porting CUDA to OpenCL!
    Well so you want to port CUDA code to OpenCL: you are in AMD GPU competition of porting Cuda codes to opencl (see previous post) or you are ...
  • Megapost!
    Today fools{ *GTX 485 is 512 cores 3gbytes gddr5 and 850/1750 shaders.. *ati 5990 has 4 gpus in board.. *bulldozer benchmarks }end fools.. A...
  • About ATI and Nvidia drivers (OCL included)!
    Hi I have been investigating AMD and Nvidia drivers.. for 10.3 there are 3d hooks support for 120hz monitors but is d3d9 d3d10 or d3d11 enab...
  • things found in CUDA forums
    Also some CUDA news: Mandelbulb stereo angalyph -> have to port to 3D Vision http://forums.nvidia.com/index.php?showtopic=150985&st=2...
  • opencl/opengl linux interop! seen in opencl cuda 3.0 sdk samples
    Following my OpenCL/OpenGL Window interop work: now has come to Linux  for Nvidia GPU computing registered developers via 195.17 driver! Als...
  • State of the blog..
    Sorry for the delay guys of posting code of Apple OpenCL demos port.. the blog has been with no updated for more than 2 weeks in this rapid ...
  • Optix and OpenCL SDKs with Visual Studio 2010
    Optix 1.0 ========= install cg download Cmake 2.80 cmake says error dumpbin not found and it is cuda doesn't work with vc2010 so copy pt...
  • CUDA 3.0 forums stuff!
    1.Getting CUBIN instead of ELF If you need the older text format, you can disable ELF cubins in nvcc.profile by changing "CUBINS_ARE_EL...
  • News from the web!
    Some things learned in AMD forums: 1.Why 3xxx no OpenCL: Compute shader mode is a hardware feature that did not exist in the HD38XX line of ...
  • Shaders: measuring perf, source translation and parsing different languages!
    Hi, I hope to be pretty exhaustive of options for parsing and translating between graphics and compute shaders ( some open source) For DX sh...

Blog Archive

  • ►  2013 (5)
    • ►  September (1)
    • ►  March (3)
    • ►  February (1)
  • ►  2012 (1)
    • ►  December (1)
  • ►  2010 (46)
    • ►  July (4)
    • ►  May (1)
    • ►  April (3)
    • ►  March (9)
    • ►  February (15)
    • ►  January (14)
  • ▼  2009 (125)
    • ►  December (51)
    • ▼  November (53)
      • Two big games coming today: State of the art Direc...
      • News from the web (IV) (big compilation)
      • Wishes in GPU drivers before Q2 2009!
      • CUDA Atomics perf!
      • GPU Compute benchmark results!
      • Interesting AMD Stream forums posts! (old posts)
      • Testing my apps with 8600GTS and WinXP!
      • A lot of Catalyst AMD drivers!
      • News from the web III
      • News from the web II (big compilation)
      • News from OpenCL forums!
      • Bugs in OpenGL AMD drivers: Geometry shader and te...
      • Testing LDS perf in OpenCL!
      • OpenCL bugs!
      • Benchmarking OpenCL and DirectCompute!
      • Benchmarking stientific kernels on OpenCL!
      • News from the web!
      • OpenCL learning and tutorials!
      • Porting CUDA to OpenCL!
      • GPU computing programming contests..
      • AMD 5xxx series overclocking..
      • OpenCL on Apple: update!
      • State of the blog..
      • Places where OpenCL shines!
      • Running Optix with Geforce in Linux
      • New exciting soft and info coming this year!
      • Matmul bench for CUDA, CAL, and MultiCore CPUs!
      • More than 10 places where DX Compute 5.0 is better...
      • CUDA 3.0 has CUBLAS functions for MAGMA with compl...
      • About IBM OpenCL
      • OpenGL interop perf in CUDA and OCL in Linux
      • Fraps like for Linux and for Windows DX11!
      • opencl/opengl linux interop! seen in opencl cuda 3...
      • AMD OpenCl forums (I)
      • About CUDA 3.0 (II)
      • About CUDA 3.0 (I)
      • CAL 2.0 vs 1.4 API
      • Naive OpenCL benchmarks..
      • Managing AMD OpenCL GPU devices and OpenCL backend...
      • About Xvba VAAPI backend..
      • CUDA 3.0 released
      • About Khronos ICD model..
      • Exploring Nvidia OpenCL 195.39 drivers:Bugs , perf...
      • Nvidia OpenCL samples with AMD OpenCL drivers!
      • Nvidia OpenCL samples on Nvidia 195 OpenCL drivers!!
      • AMD OpenCL samples on Nvidia 195 OpenCL drivers!!
      • Optix and OpenCL SDKs with Visual Studio 2010
      • OpenCL on AMD GPUs!
      • Dreaming about Ubuntu 10.04
      • News from the web!
      • OpenCL-z is here!
      • Port of Apple demos to Windows..
      • Shared memory names..
    • ►  October (21)
Powered by Blogger.

About Me

Unknown
View my complete profile