AMD news.. ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

0. AMD 56xx series released and a lot of benchamrks
1.catalyst 10.1 final on MSI (windows xp, vista,7):
seems 8.69 instead of 8.70 beta but same cal drivers (for ocl)
ogl is 92xx based instead of 93xx so not know if ext_histogram and gl_amd_stencil_write is exposed.. but seems yes..

works in ocl 2.0?

2.DirectX 11 threading support
Catalyst 9.12 has no support for driver command lists nor concurrent creates

There are display list and command list i think one is supported other no
at least it reports so..

"Is there an ETA on full hardware acceleration for command lists and concurrent creates? Catalyst 9.12 offers just software emulation which is really disappointing ..."

It's correct, I'm going to publish a code for detecting features very soon I hope here as a comment..

3.Found some OIT info of ATI mecha vs direct3d11 sdk and a demo:

"Mecha/A-buffer implementation "
demo here!

AMD releases also GPU ShaderAnalyzer

What's New in Version 1.53

* Support for Microsoft DirectX 11.
* Support for Catalyst™ driver 9.9-9.12.
* Support for ATI Radeon™ HD 5870 graphics cards.
* Support for ATI Radeon™ HD 5770 graphics cards.
* Fixed support for IL disassembly.
* Fixed issue with simple GLSL shaders.

I have posted this on AMD OCL forums:
Questions #1: about getting peak flops on amd opencl sdk: getting ISA MADs instructions.. and max kernel length..

Hi I have written some kernels for getting near to max theoretical perf on 5xxx series (5850)

I have written codes for FP single pre, FP double prec., integer and integer 24 bit..

I write mainly kernels using OCL native mad instructions where apropiate:

mad: for floating point and for doubles

mad24: uses integer 24 bit multiplies

for integers as not exist a OpenCL imad instruction I write a*b+c

The problem lies all programs compile but I can't get mad hardware instructions used as seeing AMD IL v2 and 5xxx assembly reveals excepting single precision..

Well for double precision it crashes so I have to use a*b+c form..

Altough double prec. is experimental I hope you can add mad and fma instructions as fast as you can.. this would enable some n-body example a attack to GTC09 nbody doubles Fermi perf :-)

So briefly:

Integer mad: no exists ocl instruction i get this isa:

9 t: MULLO_INT ____, PV8.w, R0.x
10 y: ADD_INT T0.y, T0.w, PS9

Single FP: correct

MULADD_e x,w,z,y

Double precision: using native double mad or fma crashes and using a*b+c i get (il):

dmul r177.xy__, r178.xyxy, r177.xyxy
dadd r177.xy__, r177.xyxy, r178.xyxy

Integer mad24:

imul+ iadd +ishl+ ishr (at amd il but assembly is the same horribly situtation)

Note that 5850 supports MULADD_UINT24 native isa instruction

so note I can't obtain better than half theoretical ops/s in DPFP, integer and integer 24..

in fact last case is 4x slower (assuming similar time for each instruction)

One problem I see for mad24 is that amd il 2.0 seems to not expose mul24 instruction so as opencl seems to generate amd il first how this is going to be solved.. isa exists MULADD_UINT24

Also I can't believe AMD is so in that early stages for that special instruction as OpenCL and DirectCompute can use to accelerate threadid index calculations for blocks/grids less than 16m elements.. CUDA programs do it a lot..I think it's a reason that CUDPP limits some functions to 16m elements..

Also the problem with integers and general code using a*b+c instead of special mad instruction could be resolved if AMD opencl compiler understands "-cl-mad-enable"

but it says:

Warning: invalid option: -cl-mad-enable

Note I have tested kernels in Nvidia OCL using a*b+c for all suported data types and they use two instruction (mul+add) but instead if I use -cl-mad-enable it uses native hardware mad instructions..

Also one note also I put a lot of mad instructions inside a loop and AMD opencl compiler crashes and before crashing it starts to use a high time for compiling .. using some moderate length mad instructions I remember CUDA compiler eats perfectly this test..

Some argument to instruct the compiler not optimizing at all.. since a block of mad instructions can't be optimized..

Also it's a problem of compiler expanding the loop? How I can control loop expansion I think Nvidia OpenCL compiler recognizes #pragma unroll..

If i publish this code as a benchmark AMD cards will be damaged..

More questions coming..

Thanks

Questions #2: from siggraph asia course..

Hi I have seen Siggraph Asia OCL course and I can't believe here some at AMD not publish a link to it..

http://sa09.idav.ucdavis.edu/

now I have some questions:

First a good presentation from AMD is not online:

Generic OpenCL Optimizations (Jason Yang, AMD)
someone at AMD can publish presentation somewhere.. seems good to learn..

also from:

OpenCL C++ Bindings (Jayanth Gummaraju, AMD)
seems AMD has a nice OpenCL Ocean demo using FFTs..

AMD is going to publish code in SDK for users learning about a complete app using OGL interop.. or better: here and now as a gift for forum readers :-)

Also from

OpenGL Interop Examples (Timo Stich, Nvidia)
altough this is from Nvidia guy stuff :-))

Regarding OGL interop: clGetGLContextInfoKHR seems of much use..

for example you can use ocl ogl interop but I have an Nvidia and AMD card (Windows 7)

and I set for example default Nvidia monitor and OpenCL AMD ICD is loaded if I create a OGL context by default will use Nvidia OGL driver and then I create a context with OGL interop this will use AMD OCL driver and in fact AMD is joking us as it will work (so native interop (in device memory) is not working and going trough system mem).. but using the other way OGL AMD context and OCL nvidia context will return an error in clcreatecontext..

I can get info from this situations or others using:

clGetGLContextInfoKHR

the problem is that is not in .lib files and not exported in Khronos ICD DLL

but is in cl_gl.h file shipped with AMD..

Altough this is AMD forum Nvidia situation is worse..

in fact they don't ship a cl_gl.h with clGetGLContextInfoKHR definition and their Khronos older ICD don't expose it..

So when it's going to be released a SDK with clGetGLContextInfoKHR function..

Last is from:

AMD IHV Talk - Hardware and Optimizations (Jason Yang, AMD)
Questions:

how much of hardware integer instructions in slide 13 are exposed currently..

AMD is working to enable through extensions?..

I'm interested in this (as at least some of this aren't DirectCompute supported):

*Reverse bits
*Integer Add with carry

*1bit prefix sum on 64b mask. (useful for compaction)
*Shader Accessible 64 bit counter

At least in isa docs I can find info about two first but I'm interested in

1bit prefix sum on 64b mask. (useful for compaction)
how to use it?.. some cal example? more info please..

Also more info on "Shader Accessible 64 bit counter"..

what isa instruction?

Search isa docs
ALU_SRC_TIME_HI: Upper 32 bits of 64-bit clock counter.
228 ALU_SRC_TIME_LO: Lower 32 bits of 64-bit clock counter.

At least for Integer add with carry we have a CUDA enabled compiler:

http://www.mpi-inf.mpg.de/~emeliyan/cuda-compiler/

And what about dx11 based ones?: find first bit, etc..

Also as said in an earlier post: 24 bit integer MUL,MULADD
well this isn't generated altough using mad24 ocl

so I don't know if:

– Heavy use for Integer thread group address calculation
is correct in slide..

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Thursday, 14 January 2010

AMD news..

0 comments:

Post a Comment

Popular Posts

Blog Archive

About Me