October 2009 ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Saturday, 31 October 2009

IBM OpenCL support!

Posted on 14:51 by Unknown

installing on PS3..
I may write about CL_DEVICE_ACCELERATOR
local mem type
get bin info
build from binaries
extensions
samples
perf

Posted in | No comments

Whises for OpenCL 1.1 and more!

Posted on 13:35 by Unknown

Make core DirectCompute 5.0 hardware features:
posted http://www.khronos.org/message_boards/viewtopic.php?f=41&t=2160
*Atomics to global and local mem. (int32 base and extended extensions)
-> now that is supported would be good to add:
*Append/consume buffers (see AMD stuff), a global queue/stack accesable with no hazards..
*Byte addressable support.
*Half support (cl_khr_fp16)
*Require that local mem is not of type global (as in 4xxx cards due to write to LDS restrictions..)
*Expanded DirectCompute 5.0 integer support (bit count,bit reverse,etc..)

As doubles (cl_khr_fp64) is an optional feature of compute shaders as 57xx proof that no luck..

Also if it's not currently required require:

*Image support for FULL profile.
*OpenGL interop for GPU devices.

Add extensions or promote to core depending if is AMD/Nvidia specific support or multivendor:

* Multivendor:
*Add support for accessing system mem from GPU kernels:
thats currently supported in both Nvidia and AMD devices an exposed in CUDA 2.2 and up and CAL.
so called pinned system mem (in CUDA 2.2 for GT 200 devices), host mem export (AMD CAL)
*Implement DirectX interop (AMD ships header)
*Getting info of integer support.. if there are native 24 int muls (CUDA devices before Fermi and AMD 5xxx (every ALU)) or int32 muls (Fermi, AMD 4xxx and 5xxx(only 5th ALU))..

AMD proposed ones (some are said hardware features 5xxx press kit some 4xxx hardware support):

*Global Data Share and Wave sync support (GDS,etc..)
*Native SAD hardware support.
*Expose registers shared per SIMD.. (shared registers avaiable in compute shader in CAL which allow doing reductions in fixed number of steps say 2 or 3 vs. logN)

Nvidia ones:

*Improve memory API for supporting CUDA 2.2 mem impovements: Expand support for creating "shared pinned buffers" (in cuda parlance) (buffers of host mem that are pinned and usable from multiple GPUs as pinned mem (using DMA)
and also shared pinned system mem.

*Expose partial simultaneous mem image objects to have read/write support with strict limitations: exposing current RWTexture Direct3D 11 abilities and also of NV_texture_barrier OpenGL extension of reading to an already bound FBO texture
of reading the same texel before writing to it..

*Expose interop with CUDA:
Code interop: support for interchanging PTX kernel code from CUDA functions or OpenCL functions with identical name and arguments (signature) and using at clBuildfromBinaries..
Mem interop: Ability to use mem buffers allocated from CUDA in OpenCL or viceversa..
This should allow directly suportig proposed "shared pinned buffers"

*Fermi support. Provide new extensions supporting this features:

*Expose function pointer and stack support which provides true function calls and recursivity..
*Expose Fermi support for executing host code inside kernels
*Expose Fermi support for allocating mem in kernels (malloc/free functions)
*Expose C++ language in Kernels (?)
*Expose expanded information of ECC support: say ECC protected registers, and mem(local/global), ECC protected path from mem GPU <-> GDDR chips.. also if possible ECC codes info: error detection capability (Fermi can detect 3 bits in and 1 bit recovery support for every xx bits..)
*Add perhaps some exception support (assuming not full C++ support as CUDA 3.0) for managing/getting acknowledged of irrecoverable errors (where (in mem chips or registers) in kernel code.. If not possible in kernel code at least finish kernel and return via some mechanism to the host this info..
*Add perhaps some info of where atomics are implemented for knowing if we can expect high performance or not (say if they are handled in L2/L3 caches (Fermi) or in memory controllers or compute units (ALUs) (preFermi))
Also NVIDIA implement some features that require no extension to OpenCL API as API model allow that.. and allow getting device info querying information of if it's avaiable and other device info support:

For example using multiple command_queues and events support for hardware that supports it:
*Concurrent mem/kernel exec.. CUDA 1.1 devices (G9x,GT200,Fermi) and AMD(?)
*Concurrent kernel execution.. Fermi (also AMD on 5xxx)
*Concurrent H2D and D2H.. using Fermi twin DMA engines.

*Predication support (I have doubts?) Equivalent to CMOV avoiding using branching hardware. Basically avoiding that conditional code gets executed executing both paths.(?)

Posted in | No comments

3D Vision and Direct3D 11

Posted on 13:34 by Unknown

Well it can take a year until until a lot of blockbuster games features Direct3D 11..
can be that 3D Vision supports last..
Also if you use compute shaders on direct3d 11 feature level 10 you can on GT200 make demos that don't go but 3d Vision i.e. on direct3d 10 hardware..
So I'm gonna to use direct3d interop to see say nvidia oncean on gt220 with 3d vision enabled..
Also I'm trying to see if its possible to run this demo demo that creates a direct3d 11 level 11 on ATI and then redirects output to a Nvidia create direct3d 9 context with 3d vision support..
When Fermi supports direct3d 11 I want to check with Direct3D 11 demos (which I have source) as say OIT or another.. OIT would be cool as Avatar has translucency and 3D was spectacular..
Also to have ogl demo stereo que he pasado a RTFSSsreensaver

The idea is to have this:
*Modify Nvidia Ocean demo running on OpenCL on Linux,MacOSX,Windows with OpenGL interop and my OpenGL QB driver on all OSes.. using my stereo QBF
*Modify Nvidia Ocean demo on DirectCompute to have Direct3D 11 context with direct3d interop and 3D Vision support..

Posted in | No comments

H264 harware decoding/ encoding GPUs

Posted on 12:02 by Unknown

Ok,
I am gonna talk about say H264 encoding and decoding:
You have software h264 CPU multithreaded implementations:
x264: multithread encode decode
intel Media sdk: multithread encode decode tough IPP on CPU. Supports GPU decode..
ffmpeg-mt: multithread decode.. ffmpeg already multithreadedly encodes trough x264..

For GPU hardware video:
Mac: Snow Leopard. Adds video hardware acceleration playing CoreVideo (only 9400m)
iphone has hardware decoder API (encoder API? since Omap 3530 has enconde up to
720p)
Iphone has AAC decoder/encoder hardware( note apple dev center)
Linux: VDPAU (nvidia,s3),VAAPI(intel poulsbo, 4500hd mpeg2 currently),XvBAU
Windows:
Multivendor
DXVA has GPU decoding: DXVA (xp),DXVA 2.0 (vista),DXVA-HD (7) (mpeg4,mpeg2,h264,vc1)
Player: media player home cinema
WMF MFT( vista): software decoding
WMF MFT(win 7): Windows 7 MFT has hardware GPU decoding and encoding see transcode sample (windows media player 11)
see dxvachecker 2.1 dxva and mft
Nvidia
nvcuvid (cuda sdk sample)
nvcuvenc (see mediacoder but private API) uses cuda
AMD:
some API?

video hardware interop:
cuvid: (ogl)yes (dx)yes
cuvenc (dont know)
vaapi: gl interop and nvidia and amd and itnel support backend vdpau backend xbvau:
vdpau: gl through ext_texture_from_pixmap (uses high CPU currently through host?)
xbvau: same as vdpau? also very efficient
dxva: dx interop d3d9 (use d3d9 interop to use direct3d 11 and dxva)
mac: corevideo (interop ?)

UPDATE:
xbvau avaiable now in gbauschenne only vaapi backend with OpenGL interop..

Posted in | No comments

Interop GPU computing graphics apis stuff

Posted on 11:51 by Unknown

we have
cuda
opencl
opengl
d3d

1.cuda to cuda
1.1.1 between context multiple devices
see shared pinned mem (host pinned for two GPUS), shared pinned host mem
(host pinned for two GPUS use as GPU mem)
optimized between contexts on same device: none existant (and memory is virtualized per context so shaders can access mem by diferents contexts in same virtual address)
We can try of CUDA 3D write textures, namely alloc 1 context set magic free alloc another context see it
1.1.2use opengl interop between devices and CUDA/OGL interop
1.2 cuda/ogl existant texture interop cuda 3.0 (no copies)
between devices optimized for Quadro/Tesla board interop (see wglaffinity)
1.3 cuda/d3d existant texture without copies. cuda 3.0
1.4 cuda/opencl mem objects from GL object by cugraphics API in cuda 3.0? of course not PTX interop (1.4 vs 1.5), not general mem see different cuda context 1.1.1, see my opencl reoprt

2.0 opencl.
1.1 ocl ocl I must see as cuda to cuda but yes pinned but not shared mem not pinned shared mem
1.2 ocl gl yes 195 cl_khr_sharing
1.3 ocl dx yes ati extension must see nvidia

3.ogl
3.1 OpenGL interop between devices
see affinity (commands send to one GPU) only windows, mac and linux yo can
see copy_image (efficient between 1context,contexts,devices)
amd similar extensions for transfering images and commands to 1 GPUs..
sync between gpus in QUadro but Radeon..
3.2 ogl dx wgl_dx_interop same device,winxp currently plans win7 vista, d3d9 currently and d3d10 plans (what d3d 11)

4.D3D 9 10 11 interop
4.1
http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=D3D9ExDXGISharedSurf&DownloadId=7944

TODO:
study AMD CAL,AMD OpenCL, AMD OpenGL?

Posted in | No comments

Thursday, 29 October 2009

3D vision good stuff

Posted on 20:53 by Unknown

There are some 3D HD videos:

Search mtbs3d:
fly me to the moon trailer

nvidia web site:
racing hd

Also good demo of 1M particles goes stereo:

Blunderbuss :
http://pouet.net/prod.php?which=53950
Making off:
http://directtovideo.wordpress.com/2009/10/06/a-thoroughly-modern-particle-system/

Posted in | No comments

Getting PTX, AMD_IL from languages:

Posted on 20:50 by Unknown

AMD:
====

CAL: Yes (with wrapper). API use calclCompile
OpenCL: Yes (with wrapper Linux &Win) For Mac use wine and llc etc..
Direct Compute: No but AMD employee able to get it don't know currently..
I need to investigate

Of course from sources AMD SKA (I suspect adding DirectCompute).

Nvidia
======

CUDA: You can get with nvcc -ptx source level. Binaries: made a wrapper or from exe from GPU ocelot see how to get cubin or ptx and for cubin then use decuda.
OpenCL: Yes from API or made that from source. if .cl not avaiable not extractable by exe then Intercept nvcompiler.dll.
DirectCompute: from sources compile and investiage.. see AMD..

Use Nexus..

Posted in | No comments

Updated CUBLAS before CUDA 3.0

Posted on 20:50 by Unknown

Did you know CULA tools
include cublas library with added funs:
Basically unify missing funcs in Z,C,D,F domains:
basically triangular solve,matrix vector,triangular matrix vector..
TRSMV,etc..

Need to see if CUDA 3.0 has some missing..

Posted in | No comments

ATI and Nvidia extensions for DX11 and 10.1 ARBs

Posted on 20:02 by Unknown

ATI
===
Seems they are hidden on catalyst 9.10
I have seen..
on ogl driver binary in catalyst 9.10 rc7 and linux ubuntu 9.10 drivers this new extensions:

dx 11
=====

1.GL_AMD_gpu_shader5
seems to have shader model 5.0

2.GL_AMD_patch_tessellator

seems to have domain and hull shaders

3.GL_AMDX_texture_compression_dxt6
4.GL_AMDX_texture_compression_dxt7
new bc6 bc7 hdr texture compression in dx11

of course 16kx16k textures should be supported now..

With these and OpenCL only multithread rendering seems to be lacking on OpenGL.. I'm sure Nvidia has also something to say
in GTC 2009.

Other extensions
5. GL_AMDX_random_access_target

semems as bindless Nvidia Extensions.. or scattering
to renderbuffers/textures in fbos..

OPTION AMD_random_access_target;
AMD_random_access_target

6. GL_AMDX_abuffer_oit

seem A-Buffer Order Independent Tranlucency support builtin
similar to DirectX11 sample

glFlushOITBuffersAMDX

dx 10.1
=======
new GL_AMD_texture_cube_map_array

NVIDIA
======
Please check that on a GT220-260 GPUS.
First 191.00 seems to be the first drivers to
have full support OpenGL 3.2 in a WHQL production driver and more important the first to implement full ARB dx10.1 extensions.

* GL_ARB_draw_buffers_blend
* GL_ARB_sample_shading
* GL_ARB_texture_cube_map_array
* GL_ARB_texture_gather
* GL_ARB_texture_query_lod

Also found on a linux glext.h from June 2009 may be reworked..

doubles
that should be supported on GT200
GL_NV_gpu_program_fp64

dx10.1
======
GL_NV_texture_cube_map_array
glsl:
GL_NV_gpu_shader4_1

assembly:
GL_NV_gpu_program4_1
GL_NV_vertex_program4_1
GL_NV_geometry_program4_1
GL_NV_fragment_program4_1

dx11
please check that on GT300 wink

all assmebly:
GL_NV_gpu_program5
GL_NV_vertex_program5
GL_NV_tessellation_program5
GL_NV_geometry_program5
GL_NV_fragment_program5

GL_NV_tessellation_shader (glsl?

GL_NV_draw_buffers3
GL_NV_anc_buffer_object
GL_NV_sample_shading_control
GL_NV_transform_feedback3
GL_NVX_volatile_texture
GL_NVX_dx_interop
GL_NVX_gpu_sync_buffer

Posted in | No comments

OpenGL 5870 extensions

Posted on 20:01 by Unknown

Posted on Opengl.org

Seems they are hidden on catalyst 9.10
I have seen..
on ogl driver binary in catalyst 9.10 rc7 and linux ubuntu 9.10 drivers this new extensions:

dx 11
=====

1.GL_AMD_gpu_shader5
seems to have shader model 5.0

2.GL_AMD_patch_tessellator

seems to have domain and hull shaders

3.GL_AMDX_texture_compression_dxt6
4.GL_AMDX_texture_compression_dxt7
new bc6 bc7 hdr texture compression in dx11

of course 16kx16k textures should be supported now..

With these and OpenCL only multithread rendering seems to be lacking on OpenGL.. I'm sure Nvidia has also something to say
in GTC 2009.

Other extensions
5. GL_AMDX_random_access_target

semems as bindless Nvidia Extensions.. or scattering
to renderbuffers/textures in fbos..

OPTION AMD_random_access_target;
AMD_random_access_target

6. GL_AMDX_abuffer_oit

seem A-Buffer Order Independent Tranlucency support builtin
similar to DirectX11 sample

glFlushOITBuffersAMDX

dx 10.1
=======
new GL_AMD_texture_cube_map_array
with these and existing

GL_AMD_texture_texture4
GL_ARB_draw_buffers_blend

Similar is denoted by =.
dx 10.1 seems to be mostly supported but with AMD existing extensions not ARB ones:

* GL_ARB_draw_buffers_blend=GL_AMD_draw_buffers_blend
* GL_ARB_sample_shading
* GL_ARB_texture_cube_map_array=GL_AMD_texture_cube_map_array
* GL_ARB_texture_gather=GL_AMD_texture_texture4
* GL_ARB_texture_query_lod

I think ati lacks 3.2 support mainly by:
glsl 1.50
GL_ARB_draw_elements_base_vertex
GL_ARB_fragment_coord_conventions
GL_ARB_seamless_cube_map
GL_ARB_texture_multisample =nv_explicit_multisample
GL_ARB_sync
GL_ARB_depth_clamp

others..
GL_AMD_video_instruction

Posted in | No comments

Nvidia 195

Posted on 19:55 by Unknown

Well all regarding Nvidia 195.39!

Has three things:
1. OpenCL:

ICD Model
=========
Seems production quality with OpenCL ICD included from Khronos.
Seems that implementations are added to the Windows Registry:
{HKCU|HKCM} SOFTWARE\Khronos\OpenCL\Vendors
Seems to search for:
VendorSuffix
OpenCLDriverName
But I can't find Nvidia one added after installing the ICD.
Also has hardcoded:
NV
nvcuda.dll
So to add ATI can be as easy as adding:
VendorSuffix=AMD
OpenCLDriverName=opencl.dll
(search ati opencl dll) perhaps rename to avoid name clashing openclamd.dll
also or copy to windows\system or add to PATH or add full path to OpenCLDriverName
Seems that dll has to add :
clGetExtensionFunctionAddress
clIcdDispatchGetPlatformIDsKHR

2. Driver
OpenCL seems to be added to nvcuda.dll
Adds:
clGetExtensionFunctionAddress
clIcdDispatchGetPlatformIDsKHR

from binaries:

New extensions:
cl_khr_fp64
cl_khr_gl_sharing

Still missing:
3d image write (fermi)
atomics 64 bits
half
fp_rounding

2.CUDA 3.0
==========
Adds CUDA 3.0. Dll reports CUDA 3.0.1.
All we can now is Driver API stuff:
Needs? to add writable 3D Arrays
Initial direct3d 11 interop:
cuD3D11CtxCreate
cuD3D11GetDevice

New generic CUDA/graphics interop:
cuGraphicsD3D10RegisterResource
cuGraphicsD3D11RegisterResource
cuGraphicsD3D9RegisterResource
cuGraphicsGLRegisterBuffer
cuGraphicsGLRegisterImage
cuGraphicsMapResources
cuGraphicsResourceGetMappedPointer
cuGraphicsResourceSetMapFlags
cuGraphicsSubResourceGetMappedArray
cuGraphicsUnmapResources
cuGraphicsUnregisterResource
(seems that finally OpenGL texture interop:cuGraphicsGLRegisterImage)
New driver apis:
cuMemcpyDtoDAsync
cuModuleGetSurfRef
cuParamSetSurfRef
Seems surface support (programmable ROPS?):
cuSurfRefCreate
cuSurfRefDestroy
cuSurfRefGetAddress
cuSurfRefGetArray
cuSurfRefGetFormat
cuSurfRefSetAddress
cuSurfRefSetArray
cuSurfRefSetFormat
See:
.surf, via surface instructions, Yes via driver, R/W, Context
.tex, via texture instructions, Yes via driver, RO, Context
My Opinion:
are writable textures (actually random access ones)
equivalent to D3D 11 RWTexture (1D,2D,3D)
as are random access say UAV..
Form Timothy Farrar:
So if one reads between the lines, .surf is effectively a high latency coherent read and writable cache, probably with format conversion, and perhaps blending. Effectively a programmable ROP. Could be how NVidia plans to take on Larrabee's programmibility, opening up efficiency for all sorts of problem solving which requires coherent scatter of small scaler values (say like a z buffer, or binning algorithms). This type of thing simply is too bandwidth inefficient to be useful currently in CUDA. Unfortunately since DX11 doesn't have programmable blending or anything resembling this functionality, my guess is that .surf doesn't see hardware support for a while, perhaps until NVidia sees if it is needed to go against Larrabee. However when CUDA gets .surf, my GL/DX days are over.

Has fermi,sm_2_0,compute_2_0
-DCUDA_NO_SM_20_INTRINSICS
(it's new ?) -DCUDA_DOUBLE_MATH_FUNCTIONS

OpenGL

Well 3.2 but
includes Cg Compiler 3.0.0.1
NV_hull_program generated by NVIDIA Cg compiler
NV_tessellation_program generated by NVIDIA Cg compiler

New extensions
==============

Are all for Fermi? I suspect fp64 ones should work with GTX 200 cards but are not reported on a GTX 200.

GL_NV_transform_feedback3-> multiple buffer streams each frequency
GL_NV_texture_buffer_object_rgb32 -> what?
GL_NV_shader_image_load_store
GL_NV_gpu_shader5
GL_NV_gpu_program_fp64
GL_NV_draw_indirect
GL_EXT_texture_compression_bptc
GL_EXT_tessellation_shader
GL_EXT_gpu_shader_fp64
GL_EXT_gpu_shader5
GL_NV_shader_subroutine ->
Not inlinig subroutines allows true calls to subroutines
(possible recursion support without tricks as Humus..)

dx11 class

GL_NV_shader_subroutine dinamic shder linkage
GL_NV_shader_image_load_store <-> Read and write to textures shader possible with scatter UAV <-> euqivalent to AMD_random_access_target
GL_NV_gpu_shader5 <-> nvidia Fermi assembly
GL_NV_gpu_program_fp64<-> nvidia double assembly
GL_NV_draw_indirect<-> d3d 11 drawIndirect
GL_EXT_texture_compression_bptc <-> new compression format (hrd one?) <-> similar to AMD one
GL_EXT_tessellation_shader <-> tesselation shaders
GL_EXT_gpu_shader_fp64 <-> Double support for GLSL shaders
GL_EXT_gpu_shader5 <-> GLSL equicalent to d3d shader model 5.0

Posted in | No comments

News from the web!

Posted on 19:04 by Unknown

Some things learned in AMD forums:

1.Why 3xxx no OpenCL:

Compute shader mode is a hardware feature that did not exist in the HD38XX line of cards. It was introduced with the HD4870. There are also quite a few differences between the SIMD units, one being data sharing between threads as added by HD48XX and improved greatly in the HD5XXX line. Currently our OpenCL implementation does not use hardware local memory on the supported HD4XXX cards because it is owners write and not a fully generate model like HD5XXX.

2. About req_work_grop_size
reqd_work_group_size lets the compiler make assumptions and optimizations about the program that cannot be made when using a dynamic local work size as specified in clNDEnqueueRangeKernel.

3.CAL Peaks flops demo:
http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=119351&STARTPAGE=2&FTVAR_FORUMVIEWTMP=Linear

I will add to my OpenCL and CUDA peak flops code.
for r8xx
Change #define NR_GROUPS 10 to #define NR_GROUPS 20 ( as 5870 has 20 simd cores )
also use cal info to use dinamically

for 3xx hardware:

In order to run that code on the HD3850, you need to change a few things.
il_cs <-- this must be il_ps
; must be equal to THREADS_PER_GRP
dcl_num_thread_per_group 512 <-- this must be removed
vaTid.x <-- this must be vObjIndex.x

Also, you need to declare your vObjIndex in the kernel like some of the cal samplers.
You also need to change from calctxRunProgramGrid to calctxRunProgram
This won't work unmodified on that card because it does not have hardware compute shader.
about time
Thanks! Got the value of 2660Gflops but time elapsed remained the same(about 6.5secs). I tried with number of threads per SIMD in range 64-256 (wavefrontsize = 64) and i got much better time with fewer threads but fewer gflops as well. The best combination was with 256 threads where time was getting close to the half (3.4secs) and gflops a bit lower than maximum (about 2600Gflops). So, is it all about overhead or something;s wrong with the app / CAL intialiazations?

4.About Warp in AMD?
Wavefront size
Actually the wavefront size is only 64 for the highend cards(48XX, 58XX, 57XX), but 32 for the middleend cards and 16 for the lowend cards. You can query via CAL for information on your specific card.
Don't possible query in OpenCL but yes for nvidia opencl extensions warp size.

The best you can do is query CL_KERNEL_WORK_GROUP_SIZE with the clGetKernelWorkGroupInfo API call and that will tell you what the largest size you can execute and the wavefront/warp size most likely will be an integer multiple of that value. In some cases that value will equal to the wavefront/warp size, but that is usually because of resource constraints.

5.number CPU cores
try set enviroment variable CPU_MAX_COMPUTE_UNITS to number of cores you want use.
6.In OpenCL kernels:
error: invalid type conversion

uint* s = (uint*)(structData + ndx * structSz);

The problem is you are converting between address spaces, which is illegal in OpenCl. The correct way to do this is as follows:
__kernel void test(__global char *structData, uint structSz)
{
uint ndx = get_local_id(0);
global uint* s = (global uint*)(structData + ndx *structSz);
*s = ndx;
}
Thanks, that fixes it. Interestingly, the NVidia compiler is not insistent on the memory specifier, but fails later.

7. Don't create binaries:
Presently, clCreateProgramWithSource is only supported. you can do what you are expecting from clCreateProgramWithBinary. This will be available in upcoming releases.

8 porting radix sort nvidia demo:

they have an error as me
http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=120879&highlight_key=y

*Beyond3d Forums:

1. OpenCL kernels not need to specify local work group size in CS
This is something I think CS should support. In OpenCL, work group size can be decided by the implementation. For example, if you want to perform computation on one million numbers, you can just tell it to create one million work items, and the implementation should automatically decide how many work items a work group should have. In a sense this is very similar to a pixel shader.

2. Computer shaders can write to backbuffer directly.

RWTexture3D issues:
I have another version running twice as fast by storing the volumes as 4 channel floats instead of single channel.
It get's a little messy then to update the sources as you can not write to single channels of an unordered access texture resource, sigh ...
Also the rendering then can not make use of the hardware texture filtering.
There seems no way to type cast a 4 channel view in a 1 channel view....
Reading from a RWTexture3D is also impossible if it is rgba, can be only single channel.

Posted in | No comments

Sunday, 25 October 2009

Mem export in OpenCL

Posted on 17:36 by Unknown

One particular thing that keeps intrigued is why accesing system mem in device kernels isn't supported right now in OpenCL as this is turn seems to be supported both by AMD and Nvidia in both their implementations (since CUDA 2.2 and CAL). I also remember seing one presentation dated back to late 2007 citing support in Brook+ as future plans so it must be there by now..
I don't know how if any this has on AMD hardware but at least for CUDA is praised as a very potentially important feature..
I'm refering as an extension and perhaps is going to be added soon.. and perhaps
it is because is requires some work..

For more info in CUDA where is called pinned system mapped mem.. more or less.. see ddj article..

Posted in | No comments

Double precision support in GPU computing APIs and GPUs and emulating it..

Posted on 11:45 by Unknown

Tody I'm going to explore double precision support in all GPU computing APIs and hardware:

CUDA
====

CUDA has been the first to support it for Nvidia cards since version 2.0 (June 2008) for GTX 200 cards and his Tesla based cards.
We have to note some things:

1. Double precision support corresponds to SM 1.3 capabilities (see cudaDeviceQuery sample for how programmatically know if the GPU has support for it). Note that this also precludes all the GT 220,240,250 chips based on GT 200 architecture which are currenly shipping in this fall season of Windows 7 renewed notebooks.. In fact this are all this present only 1.2 support.. Also note that due to lame naming some notebook GPUs (as GTX 260M) by which the name would suggest GTX 200 core are in fact based in G9x chips and as such not support double precision..
In fact any currently Nvidia mobile chip is shipping with double precision support, (excluding the possibility of some crazy desktop replacament notebook shipping with desktop parts)..
Also note that DPFP is lacking also in ION (based on G9x core) and presumibly also ION 2 GPUs (based on GT 220-250 core).. Moving lower end it's also lacking in Tegra (Geforce 6 based and in fact doesn't support CUDA) and also presumably Tegra 2 chips (G9x based which should support CUDA and OpenCL expected to be announced for Mobile World Congress 2010 in February next year).
Resuming support:
G80,G9x ->No
GTX 200 ->Yes (GTX 260, GTX 275, GTX 280, GTX 285, GTX 295)
GT 200 ->No
GTX 200 Mobile ->No
Ion,Ion 2 ->No
Tegra 1/2 ->No

2. Currently Nvidia cards have low performance in double precision computations (say 1/8 of single precision performances). That amounts to 80-90 Gflops which in turn is nearly is double of a high end quad core Nehaleam/Penry/Core Quad at 2.66-3GHz..
Thats going to change with Fermi (Nvidia next GPU) wich will have half performance vs its single precision performance wich in turn amounts to 8x increase in performance vs GTX 200 cards (which amounts to around 750 Gflops assuming core clock of 1.5Ghz)

3. You have to pass a compiler option to nvcc to enable double precision support (-arch=sm_13), it's not just enough to declare your variables as double as it will get promoted to floats.

CAL and Brook+
==============

CAL supports double precision if I remember correctly since 2007 year end.. i.e.
since first CAL was released and at the time it was also a feature only shipping in 3870 cards..
Support for a non assembly level language came later I think at the time of 4xxx series..

Anyway double precision support is only on high end series 38xx, 48xx and 58xx..
Also note that double precision performance has been better than on Nvidia i.e. at 1/4 or 1/5 approx. depend on how you count.. That is currently 544 Gflops on ATI 5870 cards..

Also note that at least at the AMD IL level (for CAL) double precision is generally supported by appending d to a lot of single precision functions (so write dmad,dmul,dadd instead of mad,mul,add) and that the vector functions instead of operating on 4 elements on the same time (on .x,.y.,.z,.w components) they operated on 2 elements. In that case .xy and .zw store the double precision values using two 32-bit registers.

DirectCompute
=============

Double precision support is only one of the optional features of the Direct3D 11.0 API. This is only supported in compute shaders (mm, I think..) and also in compute shaders 5.0. Note this precludes Nvidia shipping Direct3D 11 shaders with Direct3D level 10 and with compute shaders with double precision support (which in fact some of their hardware support it (read GTX 200))..
Of course it's expected than Nvidia Fermi which is in turn a GPU designed for Direct3D 11 among other things supports it..

Note I will release soon a tool for checking support for this and all the other optional bits of a Direct3D 11 driver.. .
Currently AMD supports it in 58xx series but due to buggy Microsoft Shader compiler it's not currently feasiable to work with..
Also by the same reason as before don't expect DirectCompute 4.0 enabled drivers to expose double support on 48xx cards..
(see pot beyond3d..)
It's expected that this get fixed in the next release (say December 2009) or the next (say March 2010 for GDC 2010)..

Note that double precision support in DirectCompute is expected to work simply declaring variables as doubles..

OpenCL
======

Currently in OpenCL 1.0 double precision support it's also an optional feature exposed as and extension (cl_khr_fp64).

To enable support for it you need to declare support for it in you kernel code:
#pragma extension Opencl: cl_khr_fp64

Currently there is only one CPU implementation and one GPU device supporting it and that is CPU:Intel CPUs with SSSE3 (Core2 and higher,Phenom..) on Apple implementation in Snow Leopard.
GPU: Nvidia 195.39 driver

Target avaiability on other platforms is more fuzzy an all that can say AMD engineers is that it will be avaiable through 2010..
All this is related to broad math library which is exposed also as part of this extension and which carries a lot of functions with very strict precision requerimients. In fact don't expect all of this functions to be supported directly by any hardware implementation soall of it has to be coded, tested, and validated which in fact takes time.. Hopefully AMD engineers have said that expect support to come gradually.. and perhaps we can expect to be able to add, substract, multiply double precision values by the end of the year.. but it seems it will be hacky since you couldn't expect the extension to be reported as supported..

Note that also in previous betas of AMD SDK at least for the CPU it was possible to denote variables by float but it will be promoted to floats, now it currently fails
saying you have to use:
#pragma extension Opencl: cl_khr_fp64
and if you use then it's said it's not supported.

I will expect Fermi to come first with support with this extension (at least full support) as Nvidia is claiming so much strong support in Fermi and also because of Nvidia advantage on exposing OpenCL extensions at the moment.. Perhaps having luck we can have towards very late 2009..

UPDATED: Nvidia supports in 195.39

OpenGL GLSL
===========

Yeah, even today, in GLSL, one way of the venerable ways of doing GPGPU computations, it possible to doing computations in double precision support.

First remember that in GLSL, and by the way also in OpenGL ES 2.0, there is a precision qualifier for variables. Also note that double is a reserved keywork of the language at least in latest incarnations, so the road to support it is well prepared. In fact for the then capped Longs Peak or anyway the full redesign of OpenGL, double precision support was supposedly coming (say between mid 2007-mid 2008), but then was finally get out of the plans.
Anyway AMD supports double precision in GLSL shaders in her cards supporting it (see CAL section for more info), altough by any concievable reason, it has not been fully publicited. In fact I found it when reading some russian forums which in turn date the support back to March or April 2009 (so in 9.3 or 9.4)..
Ok way back then I added support for a Mandelbrot program I have been given by my brother wich supported very efficiently single precision support..
In fact the miraculous words wery by then to use this precision qualifier:
__doublepAMDX
so instead of a float variable temp:
float temp;
use:
__doublepAMDX float temp;

similar for vecs:
vec2 pd;
to:
__doublepAMDX vec2 pd;

Note that AMDX denotes an AMD experimental extension, when I checked support for it in 58xx (and 9.10 or 9.11) the shader failed to compile,
now the correct way is: __doublepAMD. So seems to have left the experimental stage..

Note that is support is more broadly than I have told you in fact there are functions to pass at least double precision values to uniforms also..
In fact this is necessary in my Mandelbrot shader to pass the windows coordinates of the zoomed area or a similar value since if not the when you zoom pass single precision allows you to, then as quality is not afected by computing in double precision the precision of zooming it is..

We think also that Nvidia is going to add double precision in shaders (GLSL) soon as some header from Nvidia Linux drivers come with:
ext_gpu_shader_fp64

Another topic is when that support will get standarized (as EXT or ARB extension or promoted in the core specification..) Hopefully for Direct3D 11 like OpenGL say OpenGL 4.0 in 1 to 2 years time..

UPDATED: Nvidia confirmed for Fermi (GTC videos) seens in 195.39
Emulating double support:
=========================

The last change you have when a feature is not supported is trying to emulate.
Fortunately this is possible to add by using two single precision values (GLSL functions):

vec2 dblsgl_add (vec2 x, vec2 y)
{
vec2 z;
float t1, t2, e;

t1 = x.y + y.y;
e = t1 - x.y;
t2 = ((y.y - e) + (x.y - (t1 - e))) + x.x + y.x;
z.y = e = t1 + t2;
z.x = t2 - (e - t1);
return z;
}

vec2 dblsgl_mul (vec2 x, vec2 y)
{
vec2 z;
float up, vp, u1, u2, v1, v2, mh, ml;

up = x.y * 4097.0;
u1 = (x.y - up) + up;
u2 = x.y - u1;
vp = y.y * 4097.0;
v1 = (y.y - vp) + vp;
v2 = y.y - v1;
//mh = __fmul_rn(x.y,y.y);
mh = x.y*y.y;
ml = (((u1 * v1 - mh) + u1 * v2) + u2 * v1) + u2 * v2;
//ml = (fmul_rn(x.y,y.x) + __fmul_rn(x.x,y.y)) + ml;

ml = (x.y*y.x + x.x*y.y) + ml;

mh=mh;
z.y = up = mh + ml;
z.x = (mh - up) + ml;
return z;
}

This two functions add and multiply two values (one using .xy and other .zw)..
The good part is that double precision can be initialized from single precision values copying it to the first component (.x and .z) and other to zero (.y and .w).

Examples:

1. Initializing:
posd.y=position.x;
posd.x=0.0;

2. Adding and multipling 2 DP values:

dblsgl_add(posd.xy,vec2(0.0,offsetX));
dblsgl_mul(zoomv,posd.zw);

3. Substracting one from other is easy using -y:
dblsgl_add(x,-y);

Link cuda forums:

Using similar ideas albeit more complicated you can also get quad precision support from 1 vector of 4 single precision values.
Search:

The performance in Radeon 5850:

Emulated double: 15fps
Double: 35fps
Float: 127fps

In GTX 275: (TBD)

So at least in Mandelbrot you can expect a 2.33 slowdown by emulating it.
Note that also double performance vs float comes as 3.6 slowdown.
Finally note that in Nvidia GTX 275 where float to double slowdown is going to go up to 8x then perhaps emulating it through single precision is near in performance to native precision..

Also one thing to note regarding using Nvidia vs AMD GLSL compilers is that Nvidia one makes arithmetic simplifications wich are required to be not done to achive the emulation.. currently to overcome that we use Cg runtime which supports disabling math optimizations to shaders via passing flags to the shader compiler and also seems to be eating well GLSL kernels compiling it with cgCreateProgramFromFile and selecting GLSL CG_PROFILE_GLSLV, CG_PROFILE_GLSLF profiles and using CG_OBJECT and
using as parameters {"-oglsl","-bestprecision"}. Here -bestprecision is the flag to disable math optimizations and with it double precision works like a charm..

Note that this way doesn't work correctly on AMD cards so we must use one way for each card..

This last observation makes us see that Nvidia GLSL compiler is more clever at optimizing than AMD one albeit that's not good this specific example..

Using doubles with textures
===========================

At least with CUDA it possible to pass double arrays with textures specifiyng as int2 textures without filtering..

Posted in | No comments

Saturday, 24 October 2009

Support 3d image write on CUDA and with OpenCL wrapper

Posted on 17:54 by Unknown

One of the exciting things coming this year fall is that new and updated GPU Computing APIs and hardware will provide native support for executing 3d simulations..

By that I mean simulation running on 3D grids such as 3D wave equations, Kirchoff reverse time migration, 3d fluid simulations, etc..

Until now this has been as tricky as running GPGPU programs with GLSL i.e. it wasn't naturally supported by hardware. Let me expain in more detail why I think so:

First notice that things as 2D simulations have been always naturaly placed for hardware as 2D textures, renderbuffers or framebuffers are naturrally 2D arrays of data supported by hardware.., also 2D textures also present some great features which can be nice for 2D similations:

* First is that they provide in hardware (named texture samplers) clamping and filtering modes which can avoid doing special code for treating boundaries in simulation code and also use the additional ALU power exposed by texture filtering hardware (i.e doing limited linear combination on neighboring values..)

* Second is that they have special cache resources devoted to improve accesses with 2D spacial coherence and as such they can provide more bandwitch than expected in this situation.. (say getting the same value and the 4 boundary values)

* Third is that they support addressing by (x,y) coordinates obviating to use ALU power to transform from linear mem to 2D coords: p[x+y*width].

Textures in principle were thought as read only resources (by GPU) and were only written to in CPU code, but with tecniques as render to texture and FBOs (in OpenGL world) they were allow to be written by shaders.. Also with MRT (multiple render targets) i.e. binding multiple textures for renderers to write to it were possible..

Now as fragment shaders are naturally executed as elements of a 2D grid we can naturally (i.e. with high performance) expose a 2D wave simulation using a 2nd order FTD in time for example with three texture (2D Arrays) of the size wished for the simulation (max 8Kx8K for Direct3D 10 and 10.1 hardware and 16Kx16K for D3D 11 hardware)..
Then every simulation step we setup a FBO and atach 1 texture as framebuffer for rendering (write to texture) while the other two are passed to the shader as readonly texture and then we interchange textures..
The shader has only to run the FTD code..
Note in that case we don't need to use MRT at all..
Better yet we don't need three textures for that example as we can simply use 1 texture also with three channels (components) i.e. a RGB texture.

Please also notices how big are getting the texture limits.. In fact a 16K^2 texture with RGBA byte colors is of size 1Gbyte.. or for a simulation a R (1 channel) float texture is of the same size.. So just it would fit in a current top end card as ATI 5870 with 1Gbyte RAM. Note that a 8K^2 of a wave simulator would just fit (assuming texture with 3 float components don't size 16 bytes per pixel)
Also as I remainder I often think that if I would do a realistic flight simulator and I would just use as elevation data (DEM) currently the highest resolution avaiable to public (SRTM) is of 90m between samples so a 16K^2 texture would cover and area of 1500Kmx1500Km greater than my country (Spain which wieghts 1000kmx1000km approx.) and would weight assuming 1 component which 16 bit floats 512MBytes and which would provide also up to 4m. error as Everest has 8000m approx. and it has 11 bits of precision.. but in my country up to 2 meter resulution as it has mounts less than 4000m.

Also note that also exist 1D textures (3D textures stuff later..) and have limits
more higher in fact I think are up to 2^24 pixels which in fact can be up to 256Mbytes (using 4 components floats)..

Notice also that this kind of simulations is well mapped also because it can be written in a form that every simulation point can be calculated from others.
Anyway I mean is a kernel doing on gather and not scatter.

In fact all this stuff (2D simulations exposed as gather things) go well in this arcane GPGPU world..

Then come CUDA that among other things exposed the avaiabilty to do scatter..
i.e. writting to arbritrary position in kernels.. In fact also memory was also exposed a byte addressable one at least in this model. This one (general access to memory i.e. scatter and being byte addressable) is one of the three key innovations of CUDA among the avaiability of grouping threads in groups with the avaiability to sinchronize between them (second one) and use fast shared mem between them (third one).

All in all we can say this doesn't brought dramatic changes (in concept) to a 2D wave simulation (as threads in a CUDA grid also are grouped as a 2D grid(see more later..)) solver perhaps other than grouping threads for cooperatively reading blocks of memory and reading from that which would surely lessen memory bandwith..
In fact, altough not would be very efficient, it's possible now with CUDA and global mem atomics (in CUDA 1.1 December 2007) to express the wave kernel as an scatter one where a position updates attomically all other ones it affects (which equals to incrementing attomically a memory reference).. Better yet (with CUDA with compute support 1.2 in CUDA 2.0 (june 2008)) we can update atomically from shared mem..

Up to now CUDA as 2D array we are using "normal" global linear memory and forgetting about textures.. and we lessen memory bandwidth usage by using shared memory but we must use ALU power to transform from (x,y) addressing to linear position in memory and viceversa and also lost the use texture cache optimized for 2D usage.. We must say at this point this is the classic CUDA tradeoff of using "computing mode" vs "graphcis mode".. In fact we also lose also the power of the ROPs and rasterized in CUDA mode.. Anyway since CUDA 1.1 textures had also been present but only as read only days as in OpenGL days before all the render to texture tecniques(via pbuffers (2002-2003) and FBOs(2005 and up).. this was up to CUDA 2.2 (may 2009) where texture from pitch linear memory functionality was added (see DDJ article CUDA for the masses)..

With this you can allocate from linear memory memory that with special (x,y) access (taking care of pitch) you can write from kernels and then bind directly to CUDA textures (without copies I think..) to use in kernels as read only but avoiding (x,y) address calculations and using texture caches.. Also presumably (I feel like a speaker with more intuition which in turn can be bad than experience some times..) you don't need to use shared mem for improving mem bancdwith and use that for as cache for other things in the kernel..

All in all this is all we can do up to now (in CUDA world ) and that is very good..

In a 2D world we can say perhaps the "biggest" limitation right now comes if we want to display the simulation results using graphics API i.e. interoperabilty with the graphics APIs.. You would tell me that CUDA supports interop with OpenGL and since CUDA 2.1 (june 2009) OpenGL interop is pretty efficient doing copies in GPU memory before then it was going through system memory..
and that's true but also the CUDA texture has to be passed through a PBO.. and then from that copied to OpenGL texture..
Note that this is going to be improved I think soon (since was as an error said it was supported in CUDA 2.2 texture buffer object interop)..
I'm also assuming that there are no issues using OpenGL interop with textures created from pitch linear memory (I strongly believe that but can also be wrong and/or things could be worse when direct texture interop is present..)
Also be aware that to use interop from a running kernel from linear mem would have to create a texture object eitherway a CUDA tex object and use interop as explained or sending data to CPU and pass data to gl(Sub)TexImage2D.. eitherway CPU transfers are involved..

Also remember that DirectX interop with textures seems better and to not need device copies (i.e. same mem is used in CUDA and Direct3D)..

Also remember that another important issue can be also GPU switching between kernels (in this example to switch between CUDA mode and OpenGL mode)..
I recall of reading in Nvidia forums and employee saying to not switch if possible more than few times every frame and that just makes sense now than Fermi is said to improve this switch time by 10 times.. but still in the order of 10s of usec TODO..

Also remember that graphics interop is one of the key points of the very latest GPU Computing APIs (DirectCompute and OpenCL) that enable using the same resources within graphics shaders and within compute shaders avoiding copies..

Also another issue can come in MultiGPU scenarios which one GPU is used for simulating and onther one is used for displaying..

Special care has been taken very recently by both Nvidia and AMD for MultiGPU scenarios..
First Nvidia added in CUDA 2.3 "improved efficiency using Tesla cards for computing and Quadro for displaying" pretty vague.. bah, .. and that's for rich people you would say.. ok, go lower level and read new extensions..

In subsequent Forceware 190 series two extensions have recently added:

GL_NV_copy_image (july 2009)
GL_NV_texture_barrier (august 2009)

First is THE ONE for supporting efficient texture transfers between devices:

"This extension enables efficient image data transfer between image
objects (i.e. textures and renderbuffers) without the need to bind
the objects or otherwise configure the rendering pipeline. The
WGL and GLX versions allow copying between images in different
contexts, even if those contexts are in different sharelists or
even on different physical devices."

Note this is multiOS (linux and windows) and that you can expose also rectangles subregions to copy via CopyImageSubDataNV..

I think this feature is what is used for Tesla+Quadro efficient interop or at least in the new Quadro Digital Video pipeline..

But also remember that this exposes OpenGL multiGPU interop not CUDA multiGPU interop and this would require to interop the OpenGL texture with the CUDA
texture.. which is coming as said above..

Now return to our OpenGL only GPGPU world:

Let's talk about the other new extension GL_NV_texture_barrier:
"

This extension relaxes the restrictions on rendering to a currently
bound texture and provides a mechanism to avoid read-after-write
hazards

"

So this extension provides also the ability to read simultaneously from textures
being written.. but it's rather limited.. but for example is good with this extension to read from the texture the current value and then writing to it based on this value and possibly another of other textures..
So this is really good in fact for our 2D wave equation solver if we were using three textures we could really use two.. updating t-2 texture to t texture and t-1 becomes t-2 texture.

Also good for OIT with shaders supporting scatter (see below for shader scatter and a upcoming OIT article)..

Also this example use of the extension seems good:

Another application is to render-to-texture algorithms that ping-pong
between two textures, using the result of one rendering pass as the input
to the next. Existing mechanisms require expensive FBO Binds, DrawBuffer
changes, or FBO attachment changes to safely swap the render target and
texture. With texture barriers, layered geometry shader rendering, and
texture arrays, an application can very cheaply ping-pong between two
layers of a single texture. i.e.

X = 0;
// Bind the array texture to a texture unit
// Attach the array texture to an FBO using FramebufferTexture3D
while (!done) {
// Stuff X in a constant, vertex attrib, etc.
Draw -
Texturing from layer X;
Writing gl_Layer = 1 - X in the geometry shader;

TextureBarrierNV();
X = 1 - X;
}

However, be warned that this requires geometry shaders and hence adds
the overhead that all geometry must pass through an additional program
stage, so an application using large amounts of geometry could become
geometry-limited or more shader-limited.

In fact this extensions explains how 2D simulations running needing more than 1 texture can optimize performance by being attached as a texture array and using a geometry shader for selecting wich layer to write to (reducing texture ping pong time), and using TextureBarrierNV(); for assuring texture updated state..

All of this is related to OpenGL and in first case MultiGPU.. but please recall that in CUDA for MultiGPU scenarios perhaps you can do two another things:

Assuming you don't need graphics visualization (interop) the best thing (at least for the 2D wave simulation) theoretically it is to split load and to work with pinned mapped system mem (which is a linear mem) shared between devices (in turn pinned system mem feature also is a feature of CUDA 2.2 (avaiable on CUDA 1.2 compute devices and upper) which allows very important things:

* That the GPU operates directly on system memory directly (read and write) and transfers are done without CPU intervention via GPU DMA buffers at very high speed
(up to 80% PCIExpress theoretical bandwith)
* Avoids explicit mem transfers. Transfering when you need and using all the memory latency hiding techniques avaiable in the GPU arch. (i.e. using execution resuorces while waiting for mem to be avaiable)
* Said to improve performance in Vista Windows 7 with WDDM drivers on which Windows manages GPU mem ops (said by Nvidia engineer)..

Note 1: pinned memory refers to transfering data with DMAs without intervention of the CPU achieving very high speed transfers and was since CUDA 1.0.

Note 2: CUDA 1.1 added executing kernels and doing mem transfers independently without CPU usage.

If you think about it system pinned is more fine grained than "simultaneous kernel execution and H2D or D2H transfers" as it's can execute kernel as mem requests are satisfied contrary to waiting for full transfer and also that it allow this benefist in the model of:
1.trasfer mem data to device
2.execute kernel using this data
3.transfer mem data to host
while for using "simultaneous kernel execution and H2D or D2H transfers" you have to create two or more streams (or command_queues in OpenCL parlance) of this model.

Note 3: Fermi and CUDA 3.0 will add bidirectional simultaneous transfers to the mix which would in theory be usable by both system pinned mem and using streams.
Note: Fermi and CUDA 3.0 will also simultaneous execution of the kernels
which if you think is only usable with multiple streams and allows arbitrary simultaneous execution of any two steps from two streams one of each stream:
K-M (since 1.1)
M(H2D)-M(D2H) (CUDA 3.0)
K-K (since CUDA 3.0)
What I don't know for sure is if two memory transfers of same direction (D2H or H2D) will be executed in parallel (anyway this will not improve things as PCIE bus has a bandwidth in each way that is very good used by a simple stream at least as certain minimum size apply.. see DeviceQuery in shmoo mode)

Anyway one important thing also present in CUDA 2.2 is the "shared" thing in shared pinned mem. Shared refers to pinned host memory being pinned for more than one device (being or not mapped)..
With this feature the same host mem (where 2D wave simulation date stays) can be used from DMA engines from multiple cards.. before that a host area was pinned only for one device but not the other so say if one device needed to transfer data from one GPU to another one way (to host) goes at 5.5Gbytes/s and to the another device perhaps at 3Gbytes/s (well with Nehalem brutal memory impormevents non pinned memory goes well at above 5Gbytes/s)..
Note this also not would be very bad for fluid simulation as each GPU would allocate a region as pinned mem and only transfer more slowly boundary data (say 1 row for example it depends on the order of the discretization in y axe for example..)
Adding shared system mem to the mix now each GPU transfers at full speed and without CPU intervention for coordinating the steps:

in parallel{
GPU1
{
while()
{
does one step and increments byte1 by 1 modulo 3 in shared system mem when finished
waits for byte2 set as byte1
}
}
GPU2{
while()
{
does one step and increments byte2 by 1 modulo 3 in shared system mem when
waits for byte1 set as byte2
}
}
}

Now return to OpenGL only world.. remember we said before that we can run in CUDA
mem scatter kernels thank to the avaiability to scatter operations and atomic operations (but only in linear mem).. this are not supported in OpenGL.. well not exactly one limitation is already going be removed soon.. in fact a OpenGL extension shows up in Catalyst 9.10 drivers and Radeon 5850:
AMDX_random_access_target
and have info from AMD that DX11 OGL extensions are planned to get released (documentation about extension) by early 2010..
And Nvidia is also not sleeping with its OGL extensions for Fermi as also
shader model 5 extensions have been leaked unintentionally I think, on a header of some Nvidia Linux driver in late May 2009..
(in fact there are more see post copied from my post on OpenGL forums..)

This will surely add adding renderbuffers and/or textures (possible also 1D Buffers) which are writable to random positions in addition to MRT wich allows only writing to the implicitely fragment position given to the shader.. This would allow writing to 3D textures also (more later).. This will in fact will also allow doing OIT in a OpenGL world using OpenCL for some other things and interop to use data from one API in the other.. (find more in another post)

But for running our scatter based kernel we need atomic operations.. well,
based on Direct 3D Shader Model 5.0 and haven seen that AMD is in the works of
exposing gl_AMD_gpu_shader5 seems that atomics to fragment shaders are coming also.. as said that at least is exposed already in Direct3D 11 Shader Model 5.0 which allows graphics shaders (at least fragment shaders) to have buffers/textures which in addition to being writable to arbitrary positions (in fact there a R/W Byte Buffers which are byte addressable as linear mem)..
In fact Direct3D Shader Model 5.0 allows more, in fact fragment shaders can use atomic operations (and possibly other types of fragments)..

So in fact in Shader Model 5.0 a scatter kernel of the 2D wave equation seems possibly to do.. Note that graphics shaders still lack the concept of groupings and consequently access to some "local mem" or synchronization between them..

Finally note that with DirectCompute 5.0 you have all the needed things of CUDA (at least for single GPU usage) so you have groups, shared memory, atomics to both global and local memory, and textures which in turn can be writable with random access if desired.. So all that would have to be investigated is in MultiGPU scenarios.. Also graphics interop is there..
To close one thing that keeps me intrigued is the Anandtech 5870 review were they run a Compute Shader 4.0 demo (ocenCS demo form nvidia sdk) and is able to run in MultiGPU and improve performance.. I have to see if it has multigpu support builtin or is in AMD drivers.. also see if is the code how is coded..

Similar in OpenCL has the same features altough you will need extensions so you have:
so you have groups (core), shared memory (core if local mem is reported>0), atomics to both global (a group of extensions global atomics) and local memory (extensions local atomics), and textures (check IMAGE_SUPPORT), byte addresable mem ( byte addresable ext.) which in turn can be writable with random access if desired..
Note that all features are presently in Nvidia OpenCL implemenation on CUDA SM 1.1 devices and on should AMD 58xx cards over time (currently none is present)..
(note atomics are not supported in 48xx and also byte addressable is not but IMAGE SUPPORT should be there over time.. local mem is tricky as is restricted to no arbitrary writing)
Graphics interop is not present in implementations see my last post..

Well before switching to 3D Wave simulation let's talk about ATI some more:

CAL has texture support
ATI CAL has similar graphics interop:
TODO

Having talked of the l

see CUDa forums
See wawes 3d
also optimize 3d access for warps es decir funcion que calcul
adresas con minima instruction

Posted in | No comments

Friday, 23 October 2009

About OpenCL OpenGL interop..

Posted on 14:44 by Unknown

Right now the only way of getting it, is in Apple implementation in Snow Leopard and works for Nvidia and AMD cards (GPUs)..
For other platforms (OSs) seems that is getting more time than hoped but finally the promising OpenGL interop is coming soon..
as in the last week the cl_gl_khr_sharing was published in Khronos registry and was also added to the specification document now at revision 48..

Right now seems that the support of OpenGL OpenCL in Snow is not source code stable as per the new release as the property for context creation changes from CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE to CL_CGL_SHAREGROUP_KHR and also requires to pass
in addition the current GL context CL_GL_CONTEXT_KHR..
Note that seems that OpenGL interop cannot created on contexts having CPU devices attached on this platform..

On all platforms this interop requires first creating an OpenGL context and then creating a OpenCL context passing info about the OpenGL context with is windowing system dependant (and also some info about the drawable associated to the OpenGL context in non Apple platforms).

On Apple platform a context with OpenGL interop is created by getting the CGL share group associated to the wished OpenGL context, and passing that to clCreateContext via the property (in first parameter) CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE.

See the following code snippet from Procedural Grass demo in compute_engine.cpp:

if(bUseOpenGLContext)
{
printf(SEPARATOR);
printf("Compute Engine: Using active OpenGL context...\n");

CGLContextObj kCGLContext = CGLGetCurrentContext();
CGLShareGroupObj kCGLShareGroup = CGLGetShareGroup(kCGLContext);

cl_context_properties akProperties[] = {
CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE,
(cl_context_properties)kCGLShareGroup, 0
};

// Create a context from a CGL share group
//
m_kContext = clCreateContext(akProperties, 0, 0, clLogMessagesToStdoutAPPLE, 0, 0);
}
else
{
uint uiAvailableDeviceCount = 0;
iError = clGetDeviceIDs(NULL, kRequestedDeviceType, uiCount, akAvailableDeviceIds, &uiAvailableDeviceCount);
if (iError != CL_SUCCESS)
{
printf("Error: Failed to locate compute device!\n");
return false;
}

m_kContext = clCreateContext(0, uiAvailableDeviceCount, akAvailableDeviceIds, NULL, NULL, &iError);
}

Also note that per new spec code below (defining cl_context_properties akProperties[]) should be changed to:
cl_context_properties akProperties[] = {
CL_GL_CONTEXT_KHR,
(cl_context_properties)wglGetCurrentContext(),
CL_WGL_HDC_KHR,
(cl_context_properties)wglGetCurrentDC(), 0
};
cl_context_properties akProperties[] = {
CL_GL_CONTEXT_KHR,
kCGLContext
CL_CGL_SHAREGROUP_KHR,
(cl_context_properties)kCGLShareGroup, 0
};

Some Apple demos use it as Nbody, Displacament mapping, Grass..
all you have to do to enable disable the support in it is via defining USE_GL_ATTACHMENT to 1 or 0.

If I remember correctly in Apple also there is a flag when creating a OpenGL context for specifying that we want that device where is created it supports OpenCL contexts also..
(for other OSes is similar functionality provided by the extension..)
I will try to update it with more info and perhaps provide samples for every OSs..

For more info see (perhaps you have to login in ADC): "Using OpenCL with OpenGL” section of "OpenCL Programming Guide for Mac OS X"
which is currently no avaiable.. (PDF)
Also seems that for AMD cards 10.6.2 update will make them finally robust and stable in OpenCL..

Also I have found bits in AMD SDK 2.0beta4 showing cl_gl_amd_sharing and also CAL GL interop..
The case of CAL GL interop is at least for smiling since it's reported as supported by the CAL API extensions reporting function since at least a year (see Everest builds in Windows which provide CAL information..) But noneless is in the Catalyst 9.11 (fglrx 8.67 in Linux )release
where finally we can find bits of calGLAssociate function and others as this.. Still not working as they aren't exported by CAL libraries..
Also the GL header (cl_gl.h) in AMD OpenCL SDK includes the bits of the extension..

The Nvidia OpenCL library shows no signs of OpenGL interop but you have to remind that the latest version released in late September 2009 was the same as released by 10 September to registered developers and the driver is dated 28 August 2009 so almost two months old now..
Also Nvidia is aware that this is a current limitation as it list with the release notes
This no precludes Nvidia SDK demos from having support for it/ making use of it..
You have to use #define GL_INTEROP in the source code or compile with -DGL_INTEROP..
Anyway this won't fully enable the demos as the context must be created with OpenGL interop,
basically:

(windows)
cl_context_properties akProperties[] = {
CL_GL_CONTEXT_KHR,
HGLRC handle,
CL_WGL_HDC_KHR,
(cl_context_properties)HDC handle, 0
};
(linux)
cl_context_properties akProperties[] = {
CL_GL_CONTEXT_KHR,
GLXContext handle,
CL_WGL_HDC_KHR,
(cl_context_properties)Display handle, 0
};
(windows&linux)
m_kContext = clCreateContext(akProperties, 0, 0, 0, 0, 0);

Interesting is how this will mix with say frameworks as GLUT, SDL, specific window creation code for OpenGL 3.0 contexts, etc..

If I had a cristall ball and made money predicting the future :-) I will say OpenGL interop is coming in Nvidia 195.xx series.. Yes that are supposed (by me) to have also CUDA 3.0 support and also Fermi support.. so is I think dependent on Fermi release but say before of the year
and perhaps in very late November/early December..
Note that I expect OpenGL interop improving in CUDA also with that release namely that CUDA supports texture interop (texture buffer object support) in addition to existing VBO, PBO to direct interop also (right now I think the interop is get copying buffers in GPU memory which is anyway fast (but reduces memory with buffer duplication))..
Note that this support was by error saying as supported by CUDA 2.2 beta release for registered develeopers back in March 2009 with first Nvidia 185.xx drivers..

For AMD I will be surprised if they have before end of the year (as they have much work left see my post on OpenCL exntesinos) and I will be happy if they have by late February or March 2009..

In the meantime I will try to provide OpenGL emulation in my OpenCL wrapper library for SDKs with image support..
I will update the post to show how to modify demos from AMD, Nvidia and Apple OpenCL SDKs to use
this extension..
Also the coming soon Windows demos of the Apple demos will have this enabled..

Posted in | No comments

Improved OpenCL-Z!

Posted on 14:15 by Unknown

One important OpenCL app for me is OpenCL-Z since it provides information of avaiable implementations in a nice GUI.. similar to CUDA-Z or GPU-Z.. CAL-z anybody?

As a consequence of some binary compatibility instability it's very hard for supplying one binary that is working future proof (see previous post)..

I have solved this by providing a OpenCL builtin wrapper that wraps the functions with the two calling conventions defining for every function two function types. This is possible as OpenCL-z uses dl_open HandleLibrary for getting to the pointers bypassing the static .lib library stuff..
Once the mess is solved I will use an array of pointers for every function..

Actually OpenCL-z finding of libraries is hardcoded to find only AMD and Nvidia implementations (but anyway Nvidia is searched in the standard place also where eventually is going to be put the OpenCL ICD for Windows) :

1) Nvidia implementation:
a)$WINDIR/system32/opencl.dll (wich works on Win x64 for both x32 and x64 with 190.89 Dll's)
b)/usr/lib/opencl.so (wich works for both Linux x64 and x32 with 190.89 .so)
2) AMD implementation:
a)$ATISTREAMSDKROOT/lib/{x86,x86_64}/opencl.dll
b)$ATISTREAMSDKROOT/lib/{x86,x86_64}/opencl.so

I will also add support for adding locations of implementations in a text file..

I have been able to achieve that before the full wrapper since OpenCL-Z only uses few functions 5 or so.. . Well for checking device binaries support I need, say 10 more.. but less than the 6x or more need by OpenCL..

Also I have had to fix support for more than one platform and also for platforms supporting more than 1 device.. and realtime changing of device information..
Currently I have one bug namely that platforms for which no device is returned crash the initialization (I will try to fix before I post..).. : this can be when for example in a Win7 machine with ATI and Nvidia cards both running I disable one card (for example disabling the scree attached to it)

Also I have included logos for AMD Stream platform, and Intel platform..
Well seems I have to add Apple and S3 implementations to the mix.. and well not also an IBM one for the Cell..

I have added few missing key feature checks (as they allow to currently differentiate the two currently implementations) : OpenGL interop, image support, and if OpenCL imp can get device binaries and build programs from that binaries..

I have got it working on Linux with minor tweaks, thanks WxWidgets library, to the code and Code::Blocks project file.

Future work is to move it to use my OpenCL wrapper, port to Snowleo and add as CUDA-Z two key performance metrics:

* Peak Gflops for integer, SP and DP floating point ( using MAD kernels (for int,int24,float and doubles))
* Device bandwith, Device to Host and Host to Device bandwith (with both direct and mapped access and paged and pinned memory)..

Some other key performance metrics worth investigating for adding to the program are:

* Atomics performance in both local &global mem (now that Fermi is coming) and with use in Append Consume buffers..
* Cache test. Minikernels for studying how potential it has (using for example SPMV kernels..)
* OpenGL/OpenCL interop
* Test HD video GPU decoding->OpenGL->OpenCL->OpenGL using say constant spread filters, etc..
* Simulatenous kernel and device transfers (is supported in CUDA and seems that also in CAL and Brook+ but in OpenCL, anybody?)
* Simultaneous device to host and host to device transfers (for checking using by the Fermi dual DMA engines and also 5xxx has it?)
* Simultaneous kernels execution (also for checking Fermi and 5xxx implementations using saying two kernels using both half of the compute units and with equal load (but anyway different kernels))..

and other metrics for OpenCL extensions as global data share, SAD.. info about maximum threads in flight
Also checking mem as memtest for device memory porting existing cuda programs or estimating bit error failure rate?
Also demos as GPU Caps Viewer, etc..

Both the three

Screenshoots: (linux and windows showing 3 devices..)

Posted in | No comments

About binary compatiblity on OpenCL..

Posted on 13:41 by Unknown

Hi,
if you have following OpenCL shipping implementations it's a mess to provide executables that are future proof running tomorrow.. Important part of that is that ABI of the libraries has not been stabilized (I mean calling conventions and names of exported functions)..
A brief resume is that (on Windows):
May 2009 Nvidia ships with cdecl calling convention and clean exported names (say clGetPlatformIDs exported name is clGetPlatformIDs)
August 2009 AMD ships with stdcall calling convention and dirty exported names (say _clGetPlatformIDs@12)
October 2009 AMD ships with stdcall calling convention and clean exported names

Seems that on Octber 2009 (found from an Nvidia employee on Nvidia forums) Khronos agreed
that on Windows functions must use stdcall calling convention following Windows APIs and OpenGL ICD..

So seems that currently AMD SDK is using finally with all the right bits..

Now it's time to fix Nvidia libraries (Nvidia employee on Nvidia forums agreed that next SDK will have it fixed)..

All this effort is a the key first step part of getting an OpenCL ICD on Windows..
Note that I don't understand why but seems that all of this is on Windows x32 as in Windows x64
all works fine.. perhaps is that headers (cl_platform) check _WIN32 and is not in effect in Win64 or that the two calling methods are equivalent on Win64..

Note that on Linux currently there are no issues..
So briefly the conclusion is that examples from one SDK are running on the another (nvidia,amd) either way (well excepting implementation differences as lack of existing extensions)

I say all of that because of two things:
* I want to have a future proof OpenCL-Z! (see next psot)
* I want to make an OpenCL wrapper supporting multiplatform support something I'm not fully sure it's going to be added in OpenCL ICD.. This would add
Also I want to add three things to the wrapper:
* OpenGL emulation on OpenCL backends with image support (implement gl_khr_sharing emulation support)
* Getting AMD IL code (join previous work)
* Building kernels AMD IL code or device assembly..

Posted in | No comments

AMD IL backend for LLVM and getting AMD IL in MacOS?

Posted on 12:00 by Unknown

If you remember my previous wrapper has as a limitation that doesn't work on MacOS as ATI doesn't ship CAL libraries in MacOS and AMD support in MacOs seems to do not depend on CAL libraries. Now that seems that finally OpenCL is going to be "working" on AMD 4xxx GPUs in Snow Leopard in 10.6.2 possible next week,
a proper way of getting AMD IL from kernels is lacking.. As a "posible" measure use this solution..
Well, of course I'm supposing it works with Darwine (name recently changed to Wine ).. checked with wine-1.1.31 and works..

Regarding OpenCL on MacOs:

* Wow things are moving fast and now are avaiable Drivers for Radeons 46×0/45×0/43×0 in hackintosh computers.. with that OpenCL support on AMD cards on MacOS is virtually identical to that of AMD Stream SDK 2.0 well expecting 5xxx series..
* You have a "full suite of benchmarks" here (which will be avaiable here in short time for Windows)

Another side effect of using this method (that is avaiable as part of AMD Stream 2.0 SDK is that you can use it as and AMD IL backend for LLVM ..
Specifically you can use:
./llc -regalloc=amdil -filetype=asm -march=amdil kernel.bc -o kernel.il
see this for details

Badly it has the incovenient that generated code seems of bad quality (I mean not optimized)
fact that has been confirmed by AMD employee..

Posted in | No comments

Building OpenCL kernels from AMD IL code or device assembly code!

Posted on 09:25 by Unknown

If you have read my previous post and downloaded the wrapper and read the readme.txt contained in you know I have provided a wrapper for getting AMD IL code from programs using the CAL API ( I'm specifically interested in OpenCL programs running on AMD GPU's)..

Well now the next step is to support building kernels form this code!
This is well supported at the OpenCL API level but unfortunately AMD doesn't support it currently (2.0beta4)..
For a complete overwiew of OpenCL on AMD GPUs stay tuned for a forthcoming post..
Fortunately I provide a solution based on an OpenCL wrapper plus .
This has at least two advantages at least:
1) Should eliminate OpenCL kernel compiling time, as we can inject AMD IL code or device assembly code.
2) We can change the code of OpenCL kernels.
Two cases come to my mind:

2.a) Making slight modifications to AMD IL code to support not yet supported functionality in
OpenCL implementation but supported at the CAL level, i.e. moving more fast than AMD :-) For example we can add double precision to OpenCL kernels but with extra work (post coming soon explaining how to convert a VectorAdd from SPFP to DPFP..), and perhaps using SAD, MAD24, MUL24, etc.. once we have a new AMD IL document for 5xxx hardware..
2.b) We can optimize/tune or whatever we want to do to the AMD IL code so we can hopefully reduce register pressure, reorder code, etc..
See how the power of getting at the assembly level can push Matrix Multiplication to near 1TFlop on AMD 4xxx boards..

Code: Get it! (comming soon..)

Posted in | No comments

Wednesday, 21 October 2009

A CAL wrapper for getting AMD IL from OpenCL AMD GPU backend!

Posted on 18:09 by Unknown

I have implemented a wrapper around calclCompile primarly for getting AMD IL code as is gets generated by the OpenCL AMD GPU backend.
It prints these AMD IL code onto the stderr stream as well as AMD device assembly (using calclLink and calclDisassembleImage).
This is a temporary measure to get the AMD IL generated by OpenCL AMD GPU backend as of now is not possible to use in AMD OpenCL backend the facilities for getting
built device binaries offered by OpenCL. Also at the moment is not possible to use this device binaries to skip compiling the kernels on AMD backend, but this limitation could possibly be removed by using this wrapper jointly by with a similar and special made OpenCL wrapper. I will code one is some people send requests in my blog.

Note that assuming in future using AMD OpenCL backend you can get AMD IL code, then you can get assembly code running AMD SKA, assuming SKA gets updated or using caclLink in your code, but if AMD OpenCL binaries return assembly code there will not be an easy way of obtainig AMD IL code without my tool or another CAL wrapper.

The wrapper is named calclwrapper.dll or calclwrapper.so and you have to insert into the executable you want to get the AMD IL code from.
This wrapper should work with CAL executables and OpenCL executables using an OpenCL AMD GPU device.

You have a different method for every platform for loading the wrapper library:

* On Windows use the executable withdll.exe from Microsoft Detours library to force "insert a .dll" into the process.

* On Linux you can use the LD_PRELOAD env variable to force the load of a library by an executable.

I provide the source code for Windows & Linux as well as the wrapper library an example scripts to use it.
Currenlty it supports only Windows 32 bits. Also Linux library is tested on 32 bit platforms altough building it should be ok on Linux 64 bits.
Note it's currently impossible to use in the AMD GPU backend of Snow Leopard as this it isn't currently using (and possibly never?) CAL independent libraries/frameworks.

I have tested on Windows 7 x64 and Ubuntu 9.04 x32 on ATI 5850 using ATI 9.11 beta drivers.

Get it!

Posted in | No comments

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Saturday, 31 October 2009

IBM OpenCL support!

Whises for OpenCL 1.1 and more!

3D Vision and Direct3D 11

H264 harware decoding/ encoding GPUs

Interop GPU computing graphics apis stuff

Thursday, 29 October 2009

3D vision good stuff

Getting PTX, AMD_IL from languages:

Updated CUBLAS before CUDA 3.0

ATI and Nvidia extensions for DX11 and 10.1 ARBs

OpenGL 5870 extensions

Nvidia 195

News from the web!

Sunday, 25 October 2009

Mem export in OpenCL

Double precision support in GPU computing APIs and GPUs and emulating it..

Saturday, 24 October 2009

Support 3d image write on CUDA and with OpenCL wrapper

Friday, 23 October 2009

About OpenCL OpenGL interop..

Improved OpenCL-Z!

About binary compatiblity on OpenCL..

AMD IL backend for LLVM and getting AMD IL in MacOS?

Building OpenCL kernels from AMD IL code or device assembly code!

Wednesday, 21 October 2009

A CAL wrapper for getting AMD IL from OpenCL AMD GPU backend!

Popular Posts

Blog Archive

About Me