More than 10 places where DX Compute 5.0 is better than OpenCL! ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Sorry xbitlabs but I don't think this..
Please understand most of my thoughts are based in using one API vs other today or in the near future and
based upon some common sense on companies involved..

The whole point is that comparing OpenCL 1.0 and DX Compute bare metal DX Compute gains in functionality and richness of features to OpenCL.. Also because of the potential broad market an use in blockbuster next gen games vendors will spend more time optimizing DX Compute drivers.. at least in the early days which is today.. this is true and ATI ships phenomenal DX Compute drivers for 5xxx series..
Nvidia has done a great effort in CS 4.1 which anyway continues to be more rich that OpenCL 1.0
Also as with OpenGL vs DX development today there is the mad situation of OpenGL of having to have multiple rendering paths in function of extensions if you want to use the last feature and DX having almost no cap bits..
In DX compute you not only can use all that is avaiable (which is more than OpenCL by default) also is expected to be fast see 5

Let's see why..

OpenCL being targetted to HPC and scientific applications and being brought forward by a lot of companies has not enabled by default a lot of functionality shipping in old CUDA hardware ( I mean atomics for example)..
this is for a lot of vendors can claim compliance (say CPUs,GPUs (also S3), PS3 Cell (embedded profile), Power SGX chips(embedded profile)..
also as there a also not GPU implementations a lot of graphics stuff is optional..

Anyway OpenCL regains with current extensions a lot of the current functionality to no DX Compute but also CUDA..
As always (I mean as in OpenGL drivers) Nvidia has done a good work and today in 195 drivers for both Linux and Windows OpenCL with good performance and also atomics,byte addresable mem,graphics interop, and double support..

Anyways don't get pessimistic as the situation is similar today with OpenGL there are major vendors shipping mediocre drivers (hi Intel) but anyways if you stay with Nvidia (ok.. AMD does quite well today too) you can get OpenGL 3.2/3.1 drivers and a lot of other extensions.. and that enables today Nividia to ship almost equivalent functionality in DirectX 10.1.. What about DirectX11 well Nvidia has some cooked functionality in 195 drivers and I don't doubt they will have same day as Fermi is released a lot of extensions to bringing in parity to Direct3D 11.. And also that will to Linux users.. In fact Nvidia has always enabled to the point of Geforce 8800 launch demos being OpenGL ones the majority and using thier DX10 OpenGL propietary extensions (which anyways in less or more time and with some makeup get into ARB or the GL Core functionaliy)..

Details next:

1. Graphics interop
=============
I mean with APIs is builtin needed not extension (cl_khr_gl_sharing) and shipping today both vendors..
Only Nvidia in OpenCL.. ATI working on it..
Also while is supposed to be without copies on both APIs I believe that somewhat can be a fool on some early drivers in OpenCL.. DX Compute I feel is good implemented in both vendors today anyway..
Bad for DX Compute is only DX interop while OpenCL gets OpenGL and will get DX interop (at least supposed on ATI)

2. Image support builtin
================

Using textures from cs is not an extension and shipping today both vendors.
Only Nvidia in OpenCL today..

ATI working on it..

3. CS (5?) can write to backbuffer directly
=========================
See Voxillas Mandlebrot demo for ex.
Best OpenCL can do is render to a renderbuffer or texture attached as a color buffer of a FBO using OGL interop which is an extension..
Needs anyway using a copy Frambufferblit to the backbuffer..

4. DX Compute has local, global atomics and byte addresable mem (RWByteBuffers) by default (CS 5.0)
================================================================
This are currently shipping today both vendors (well Nvidia without atomics becuase isn't in CS 4.1)
In OpenCL we only have in Nvidia today..
ATI working on it..

5. Local mem is always fast hardware local mem (shared mem nvidia,lds ati)
=================================
CS 5.0 exposes general r/w to local mem as DX11 has it (Nvidia has since G80 but ATI not in 4xxx)
Cs 4.1 exposes limited write abilty similar to 4xxx so 4xxx can be used very fast in this concrete cases..
In OpenCL 4xxx local mem is emulated using local mem so than can get programs say 10-20 times slower than in 5xxx and also fool programer thinking using local mem will get code fast and get slow (because double mem compies if is buffering to local mem)

6. 3D writable textures (CS 5.0)
======================
Altough 2D writable textures without copies have not been in CUDA world until May 2009 (well in non beta form) DX Compute comes with 3D writable textures.. name RWTexture3D..
a guy named Voxilla has released a 3D wave equation solver using this functionality..
it allows up to 250fps of a 400^3 of floats impressive sutff.. 16gpoints/sec which using a FTD of 10 o 11 flops per point is equal to 180 Gflops and better about 500Gbytes/s bandwith..
they Bw are so high because we are writing to memory that is cacheable and better yet tuned for spatial locality at least 2D (not linear mem locality) (global mem is not at least in pre Fermi days or 5xxx days)..
Well there is a trick by a user (that I will explain one day) that allows to write to 3D Textures in CuDA but it's not very efficient and possibly not future proof (the user reversee enginered in mem how 3D textures where stored) and write code to read and write to a specific pos..
ATI shipping today..
OpenCL has this functionality as an optional extension and expect requiring DX11 hardware in GPUs there is no support today (ATI is a bit later in OpenCL extensions and Nvidia has no DX11 hardware)..
So we hope in Fermi time.. ATI will not get eariler

7 No builtin support for append consume buffers
=================================

Not extension now for OpenCL.. AMD supposedly working on it?

8 Interop with shaders which can do scatter to textures
=====================================
This is a SM5.0 killer feature of fragment shaders write with random access to mem within shaders can be good for example to an performant and memory efficient Order Independant Transluceny..
In OpenCL you need OpenGL with some equivalent DX11 extensions (anyway both AMD and Nvidia are working on it and AMD ships in current drivers but no doc)

9. ATI has DX11 optimized vs OpenCL
==========================
Believe me or not, but ATI has very good Cs drivers as games are coming but OpenCL drivers are rude..

10.Autovectorizing& MultiGPU
=====================

Compilear Shaders have autovectorization for pre G80 cards and also for 5way SIMD in ati cards right now this is not supported on AMD where would be needed for code to go full speed..
Its also not for AMD Cpus for using SSE trough
Anyway if I make the effort to try to write in vector code (old Cg days) I want at least to write on time so I have to test if say float8 or float16 which would go well for AVX and Larrabe are also efficient on CPUs with SSE on AMD platform and AMD GPUs i.e for example troughput using float8 is half of float4 and float16 1 quarter.. also seens that there is a compiler hint (I don't know if it's get by the compiler or the users directs the compiler for vector size optimal).. so I have to see it how to perhaps write efficient and variable floatn code using info from/to compiler..
IBM on the Cell SDK says to autovectorize code inside and across workgroup threads so seems very good..
but note that by default CEll compiles to 1 item per workgroup unless you request it in code.. perhaps in this case all is lost..
Also Intel upcoming OpenCL seems to be able to use autovectorization as with shader compilers..
as using floatn
ATI has expressed interest in trying to get it but it seems at least 6-12 months of waiting if not 1-2 years..
judging by OpenGL driver..
Nvidia is the more lucky in this respect as maps good to a scalar one in concept (learn about warps..
All this is said because I think that DX Compute on AMD is using autovectorization since I think it's done by the same driver team of DirectX that is good at compiler shaders.. and benchmarks of DXNvidia Ocean demo on ATI I have to test with scalar and vector throughput kernels..
Last thing to note is about MultiGPU anandtech showed using multigpu nvidia ocean have to see if it's using
2 devices for compute shaders or drivers as for shaders have support for it..
No can be general Obviously. at least if more than 1 kernels are launched and memory is changes or if atomics? are used

11. Guarantess minimums values for shared mem size and workgroup (?) and 3D grids
===========
I will have to check what I say but I think DX Compute requires support for running in some minimum of elements and also minimum fast local mem size..
OpenCl requires no local mem(?) and workgroups can be of size 1 (at least on CPU, as Apple use 1)
DX Compute 32Kb and some minimum size for groups and 3D grids are required..
This allows for 1 implementations that works in every DX11 device and OpenCL imp have to have fallbacks if wanted to run on CPU for ex.
Also assuming for ex s3 uses 1 item per workgroup and has no local mem what has about pixel shaders (mm.. scatter to mem)

12. Images read and write simultaneous.. ?
===========================================
In openCl __readonly and __writeonly. DX? RWTexture ?
at least sm 5.0 alow fragment shaders to do it but similar to nv_texture_barrier
also mem scatter and global atomics so you lose local groups and shared mem and __syncthreads

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Sunday, 8 November 2009

More than 10 places where DX Compute 5.0 is better than OpenCL!

0 comments:

Post a Comment

Popular Posts

Blog Archive

About Me