GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Sunday, 21 March 2010

What's for CUDA 3.1 and OpenGL 3.3/4.1!

Posted on 12:37 by Unknown
Let's see CUDA 3.0 vs beta:

*adds full blas support
*opencl local atomics
*ocl i cuda d3d9-11 interop..
*updated guides since beta..
still no ptx 1.5,2.0 specs..
also nv-cl extensions published now: http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/opencl_extensions/cl_nv_compiler_options.txt

Interesting notes.
*Float16 (half) textures are supported in the runtime
*cublas complete i ieee754 complaint fermi
 *SGEMM performance on Fermi-based GPU is 30% lower than expected. 
    It will be fixed in 3.1.
*The stability of the large-prime FFT transform (signals with a length
    that is prime and >64k samples) is extremely variable, giving single-
    precision accuracy in the range 0.005->0.025. In general, smaller signals
    experience greater accuracy.
*This package will work MAC OSX running 32/64-bit.  
      *     CUDA applications built in 32/64-bit (CUDA Driver API) is supported.
       *    CUDA applications built as 32-bit (CUDA Runtime API) is supported.
           (10.5.x Leopard and 10.6 SnowLeopard)
Note: x86_64 is not currently working for Leopoard or SnowLeopard
*CUDA applications built with the CUDA driver API can run as either 32/64-bit applications.  
 *  CUDA applications using CUDA Runtime APIs can only be built on 32-bit applications.
SDK Release 3.0 Final:
* Replaced 3dfd sample with FDTD3d (Finite Difference sample has been updated)
* Added support for Fermi Architecture (Compute 2.0 profile) to the SDK samples
* Updated Graphics/CUDA samples to use the new unified graphics interop
* Several samples with Device Emulation have been removed.  Device Emulation is 
  deprecated for CUDA 3.0, and will be removed with CUDA 3.1.
* Added new samples:
   concurrentKernels (Fermi Capability)
* Bug Fixes
have added simplempi also..
have to test with intel mpi 4.0

MAC notes:
cuda.dylib is 64bit and has 195API and 195 185 dylibs versioned as 195_96 or 185_55..
*has cuda-memcheck but no cuda-gdb
*cuda kext is fatbin with 64 bits and also cuda.dylib so cuda driver applications are compatible with 64 bits
and compilable..
note also can boot in 64 bit kernel due to kext..
cudart 32 bit
then we can in theory program a cudart wrapper over cuda driver and compile in 64 bits more
now than cudart is stateless and has interop with cuda driver mem alloc..

all needed is cublas and cufft to be 64 bits compile in that..
we have code for cudpp,thrust and cusp and in the meanwhile volkov matmul,fft and lapack codes

so all these can be compiled with 64 bits if we had a cudart 64 bit and see what's up..
well I have compiled cudadevicedrv and matmuldrv
(i'm the first in the world to have 64 bit cuda apple binaries? excepting at nvidia..?)
I have get rid of cutil though compiling to 64 bits would be no problem some notes:
nvcc on mac defaults to 32 bits vs gcc defaults on 64 bits on Snow leopard..
so for using 64bits you must use -m64 in nvcc..
but for cuda driver projects nvcc is of no use since you can use g++ for cuda driver api and compile cuda
files to ptx with nvcc -ptx
if you use nvcc with -m64 you get both cpu 64 bit code but also using -ptx you get ptx code
using 64 bit pointers for Fermi?
so you can use 32 bit pointers in Fermi is better use 32 bit pointers..
so matrixmuldrv use nvcc -ptx for 32bit pointers and use g++ (-m64) and you get
but cudamoduleloaddataex i get error
CUDA_ERROR_POINTER_IS_64BIT     = 800,      ///< Attempted to retrieve 64-bit pointer via 32-bit API function
loading ptx either if I use a nvcc -m64 or nvcc (all with -ptx) get this error..
so ptx with 32 or 64 bit pointers doesn't change that..
I have to compare files with 32 and 64 bit pointers to see differences also with sm_20..
also note for nvcc -m64 to work either if it not needed needs /usr/local/cuda/lib64 to exist..
so I have copied lib->lib64 or do a symlink..
so you can now run it..
I have to write tutorial of using cuda and nvcc and achieving macos fat binaries(i386 ad 64)
*I see nvcuvid library for mac in gpu computing sdk.. only 32 bits..
/C/common/lib
and /C/common/inc/cuvid

Anyway I have a libcuvid (vs libnvcuvid) for 64 bits /usr/local/cuda (where i have get from?)
*also a pref pane control panel with autoupdate and shows gpu driver version and cuda driver version..


note opencl samples on mac no work until 10.6.3..


good is opencl not definided behavior (implementation specific) for nvidia:
http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_OpenCL_ImplementationNotes_3.0.txt

issues with mac..

opengl 4.1/3.3 perfect release:
*ext_direct_state_access
*ext_separate_shader_objects
*RW textures (3d also) ext_image_load_store
*binary shaders (gl es 2.0 api)
in theory you can use some ir from 3dlabs frontend compiler source..
or also translate to hlsl via som translator (amd hlsl2glsl?) and then use binary hlsl shader..
also a good translator..
http://code.google.com/p/angleproject/
has flex/bison glsl parser and also a glsl2hlsl translator (es 2.0)..
going from binary to dx il via:
fxc /dumpbin
but dx il to binary? also how from dx il->hlsl or glsl directly..
I also have found wine handles/parses more or less dxbc files..
/dlls/d3d10/effect.c
static HRESULT parse_shade

NV OGL extensions:
*fermi fuction pointers and recursion for glsl?
would be good addition to bindless  extensions and shader buffer load..

CUDA 3.1:
*cuda-gdb OpenCL HW debugging support..
*pinned GPU mem interop with MPI Infiniband.. (spring10 in sc09)

*template for a DirectCompute project
Currently there is no template for a DirectCompute project, but NVIDIA will be
    providing one soon.
*Fix perf of CUBLAS SGEMM by 30% faster on Fermi
*Fix CUFFT perf vs 3.0beta goes 180-190gflops to 150gflops
*provide official cudaasm/decuda or documentation about cubin/ELF format for SM_20 devices? also for sm_10?
*PTX 1.5, 2.0 docs?
*Updated opencl best practices for Fermi? cuda best.. guide is updated but for Fermi?

*Surface functions: RW textures with x,w addressing etc.. also 3d image writes.. headers and exported functions in beta but removed in final..

Also CUDA to CPU compiler or is gpuocelot mature enough and also mac and windows ports avaiable..
would be good a direct PTX2CPU code conversor and using gpuocelto lib as cudart and cuda api..

Mac

*add cuda-gdb (with ocl also) and OpenCL visual profiler

opencl mac no xutan 2 ejemplos

cuda opengl slow mac
ship
is going to work with fermi cuda.kext


*Related is first 195 series 197 whql driver for Quadros enabling OpenCL on these devices..
Adds support for CUDA 3.0 for improved performance in GPU Computing applications. See CUDA for more details. 
This driver resolves fan speed issues reported with version 196.75 drivers.
Adds support for the Open Computing Language (OpenCL) 1.0 in Quadro FX Series x700 and newer as well as the FX4600 and FX5600.
*Nvidia mentions compute cluster driver but is 196.28 not updated since early feb.. anyway d3d interop
added finally is not nedeed here..
*
to pierre boudier you cansee ogl 4.0 drivers soon and also a image write and random access extension soon ala d3d11 rwtexture..
ubuntu 10.4 fglrx 8.72
fglrx-installer (2:8.721-0ubuntu1) lucid; urgency=low 

* New upstream release: 
- Restore compatibility with kernel 2.6.32 and xserver 1.7 (LP: #494699). 
- Add Passive Stereo support on workstation (FireGL/FirePro) hardware. 
- Add Eyefinity support (more than 2 monitors on Radeon HD 5xxx hardware). 
Officially WS-only but should work on consumer boards as well. 
GL_EXT_shader_subroutine GL_EXT_timer_query

Also what about 3d stereo on linux:
*3d vision for opengl qb on quadro with stereo connector is here..
*a 3dtv for linux so opengl qb can be output to hdmi 1.4 on linux? this can add working on low profile quadros as stereo connector is not needed (is not needed in 3d vision is Nvidia way of artificially limiting to super high end quadros well expect perhaps better synch..)
also if they add VDPAU h.264 MVC and you decrypt bluray3d with anydvd hd you will be able in theory to see it in linux gpu accelerated decoding and sending to tv's via hdmi 1.4..
let's see also how windows is handled as not dxva 2.0 support it mvc? also not cuvid so leet's see if they add it to cuvid also..
so seems all cyberlink will get some library by nvidia or what?
*ATI has hooks for d3d9,10? d3d11? in 10.3, also fglrx 8.72 add passive stereo for ogl qb (active stereo is here right?.. but for 120hz lcds also?)
let's see also how ati manages output to HDMI 1.4 tv's via either IZ3D partnership or what? in fact I expect iz3d only hooks d3d stereo and the amd will add some HDMI 1.4 stereo from this hooks so will be good a sdk or documentation of this hooks..
Also Nvidia will be good publishing stereo sdk (promised in gdc2010) and hope also this hooks (d3d9-11) will work with 3dtv and output to hdmi 1.4 tvs.. In fact yes as Avatar and 3d stereo vision use this hooks presumably..
mac is out in this scope..

also nvidia can be late with fermi but not with software supporting it..
now d3d11 is with cs5.0 here and also we have now d3d11 interop for cuda in 3.0 and d3d11 interop with opencl extension and also optix d3d11 interop..
We have d3d11 interop with:
*CUDA 3.0
*OpenCL
*Optix
HW debugging:
Nsight.
All need to be released is nsight which will also bring d3d11 support (hw debug and profile) wii be good to hw debug cuda, d3d11 cs, cuda with d3d11 interop, and trace opencl and opengl (4.0? will be traced?)..

also cg 3.0 will have support for d3d11? and also sm5.0 opengl 4.0 support? i.e. tesselation shaders with glsl output?
note cgc 3.0 is shipping on tegra sdk and also as part of nvidia drivers 195 opengl compiler..
I have seen cgfx working with optix and cuda in a blog so hope they ship example soon..
http://lorachnroll.blogspot.com/2010/03/mixing-nvidia-technologies-thanks-to.html


GPU: GF100 @ 700MHz
- CUDA cores: 480 @ 1401MHz
- Memory: 1536MB GDDR5 @ 1848MHz 384-bit
- TDP: 250W
GeForce GTX 470:
- GPU: GF100 @ 607MHz
- CUDA cores: 448 @ 1215MHz
- Memory: 1280MB GDDR5 @ 1674MHz 320-bit
- TDP: 225W
- Price: $349US

- 3D APIs: OpenGL 4.0 and Direct3D 11
- GPU Computing: OpenCL, CUDA and DirectCompute
- 3-way SLI support



GeForce GTX 480 : 480 SP, 700/1401/1848MHz core/shader/mem, 384-bit, 1536MB, 250W TDP, US$499

GeForce GTX 470 : 448 SP, 607/1215/1674MHz core/shader/mem, 320-bit, 1280MB, 225W TDP, US$349

Note also we have like GLSL and OCL vec4 and other C++ libraries:
*GLM has GLSL strict compliance..
even with GMX experimental extensions we have SIMD implementations..
*DX SDK feb 2010 has XNAMATH 2.02 SIMD math library
also read:
http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php

HDR good maps:

http://www.hdrlabs.com/sibl/archive.html


Nvidia employess blogs:


http://timothylottes.blogspot.com/
http://jamesdolan.blogspot.com/
http://industrialarithmetic.blogspot.com/
http://castano.ludicon.com/blog/

http://twitter.com/castano

http://twitter.com/tmurray_cmpxchg

showing max cuda mem:
http://forums.nvidia.com/index.php?showtopic=102682 cuda maxmem

caustics patents:
US patent applications: 20090096788, 20090096789, and especially 20090128562,


The LLVM 2.7 binaries are available for testing:
http://llvm.org/pre-releases/2.7/pre-release1/
http://amnoid.de/tmp/clangtut/tut.html
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2009-May/005167.html
http://synopsis.fresco.org/
Performance inconsistencies when testing various bit-counting methods 
ubuntu cheat cube:119834-cheat-cube-ub
ie9 VML to SVG Migration Guide
windows phone 7:
*xna ctp 4.0 avaiable works with pc but only reach profile not hidef..
*unlocked image with all apps instructions on a blog..
*petzold samples and book excerpt avaiable..
*also sqlite port ->csharp-sqlite.wp

Windows 7  XP Mode now has support for CPUs without virtualization VT-D support..
Windows 7 SP1 virtualization news:
With Microsoft RemoteFX, users will be able to work remotely in a Windows Aero desktop environment, watch full-motion video, enjoy Silverlight animations, and run 3D applications," Microsoft's Max Herrmann writes, "All with the fidelity of a local-like performance when connecting over the LAN."
cuda will work with it? i.e. no need for compute cluster driver and also ogl,dx and interop support..
Q: Will RemoteFx support also OpenGL hardware acceleration which is the 3D high level API used by professional applications like CAD systems or medical applications ?

A: RemoteFX will support certain OpenGL applications. However, as the development of RemoteFX is still ongoing, it is too early to provide any specifics at this point.
Q: Are you plan to introduce RemoteFX also for Windows 7 because their are many scenarios where the remote system is not a server but a high end workstation ?
A: RemoteFX has been designed as a Windows Server capability to support the growing demand for multi-user, media-rich centralized desktop environments. Windows 7 will be supported as a virtual guest OS under Hyper-V.

Dynamic Memory is an improvement to Hyper-V which allows users to pool all available physical host memory together, and dynamically allocate it to virtual machines. In other words, if the workload changes, VMs can get access to extra memory without having to shut them down.

XNA forums:
Updated list of D3D12 suggestions
Unable to perform a recursive call with DirectCompute? 
How to AttachBuffersAndPrecompute to ID3DX11FFT
RWStructuredBuffer counter
The IncrementCounter is faster than IterlockedAdd(Buffer[0], 1) in 4 times.
Gamefest 2010 presentations?
D3D11 / D2D Interoperativity
329M pairs/sec radix sort performance, 408M keys/sec - crushes CUDPP numbers
AppendStructuredBuffer driver bug?
How to debug DirectX 11 Compute Shaders?
Creating a Shared Surface with DXGI


atomic
I have some questions about RWStructuredBuffer:
1. How to copy hidden counter to system memory? CopyStructureCount
2. How to reset the counter to zero? last argument of OMSetRenderTargetsAndUnorderedAccessViews
3. Why the performance of this counter is much more than the performance of InterlockedAdd at the element buffer? (HD 5670)
The IncrementCounter is faster than IterlockedAdd(Buffer[0], 1) in 4 times.
How to AttachBuffersAndPrecompute to ID3DX11FFT?

http://gephi.org/

http://forums.xna.com/forums/t/49607.aspx
Thank you. I forgot about debug version of the D3DCSX. Debug message proved to be helpful. For the record: 1. The number of buffers attached must be exactly the same as in D3DX11_FFT_BUFFER_INFO. 2. The views MUST be created with the D3D11_BUFFER_UAV_FLAG_RAW flag (although it wasn't mentioned in documentation).



The Chrome dev channel release has support for an Open GL ES 2.0 interface 
for Native Client. This is something we said we would do sometime last year. 
When we consider it stable, documented etc. we will do more of an 
announcement.

Google are announcing that NaCl now also supports x86-64 and ARM.
http://www.osnews.com/story/23021/Native_Client_Portability_Almost_Native_Graphics_Layer_Engine
NaCl_SFI:Adapting Software Fault Isolation to Contemporary CPU
Architectures
pnacl: Portable Native Client Executables

from GDC:

this are also graphics API translations:
Cider & Cedega: Direct3D on OpenGL
GameTree.tv: Direct3D on OpenGL ES
SwiftShader: DX Software Rendering (also WARP)
ANGLE Project: WebGL (OGL ES 2.0) on Direct3D

now we need GPGPU apis so:
cuda on opencl?
cuda on cal?
directcompute on opencl?
opencl on directcompute?

posted on opengl and cuda forums:
Questions to nvidia:
*Is Nvidia going to expose ext_gpu_shader_fp64 on GT2xx hardware with double precision or is for d3d11 hardware?
For example gtx275
AMD seems to support double precision on GLSL via doublepAMD even on 4850 cards..
Also is Nvidia with initial GL 4.0 drivers going to finally expose documentation for wgl_nv_dx_interop and have the shown at gtc texture writting and random access support?
via ext_image_load_store?
Please post PTX 1.5 and  2.0 documents..
Also I'm summing here things promised soon by Nvidia so let's see how much it takes before we get:
*cuda-gdb support for hardware debugging of OpenCL kernels
*cuda-gdb GPU debugger for Mac (with OpenCL support also)

Mac related:
Is mac 64 supported?
This package will work MAC OSX running 32/64-bit.  
           CUDA applications built in 32/64-bit (CUDA Driver API) is supported.
           CUDA applications built as 32-bit (CUDA Runtime API) is supported.
           (10.5.x Leopard and 10.6 SnowLeopard)
Note: x86_64 is not currently working for Leopoard or SnowLeopard
UDA applications built with the CUDA driver API can run as either 32/64-bit applications.  
    CUDA applications using CUDA Runtime APIs can only be built on 32-bit applications.




My mac notes:

nvcc matrixMul_kernel.cu matrixMulDrv.cpp  -I../../common/inc/  ../../lib/libcutil_i386.a matrixMul_gold.cpp -Xlinker /usr/local/cuda/lib/libcuda.dylib
nvcc matrixMul_kernel.cu -c -m64
g++ matrixMul_gold.cpp matrixMulDrv.cpp  -I../../common/inc/ -I$CUDA_INC_PATH -L$CUDA_LIB_PATH /usr/local/cuda/lib/libcuda.dylib ../../lib/libcutil_i386.a

para nvcc -m64 crea lib64 con copia de lib
nvcc -m64 deviceQueryDrv.cpp  -I../../common/inc/ -I../../../shared/inc -Xlinker /usr/local/cuda/lib/libcuda.dylib
quita cut
nvcc defaults 32 bits
gcc defaults 64
g++
g++  deviceQueryDrv.cpp  -I../../common/inc/ -I../../../shared/inc  /usr/local/cuda/lib/libcuda.dylib -I$CUDA_INC_PATH

//#include
#define CU_SAFE_CALL_NO_SYNC(a) a
//CUT_EXIT(argc, argv);

export CUDA_BIN_PATH=/usr/local/cuda/bin
export CUDA_BIN_PATH=/usr/local/cuda/bin
export CUDA_LIB_PATH=/usr/local/cuda/lib
export CUDA_INC_PATH=/usr/local/cuda/include
export PATH=$PATH:/usr/local/cuda/bin



Read More
Posted in | No comments

Thursday, 18 March 2010

raw data..

Posted on 11:32 by Unknown
games:
*metro 2033 and just cause 2 demo avaiable! (fermi launch titles?)
*assasins creed2 and bad company 2 this month also..
*Command & Conquer 4: Tiberian Twilight
*3d vision cd 1.23 has direct3d11 support! (so list support for  d3d11 fermi supersleddemo)


iexplore 9 preview with direct2d directwrite support


*3D texture based separable convolution, extension of SDK example
code:
http://forums.nvidia.com/index.php?showtopic=163382

*bin format for fermi is similar ptx: post luebke on gpgpu-sim mailing list
one guy from pathscale says he has all info on this and other low level info presumably PTX 1.5,2.0 specs (bin format spec?) and also info for open source cuda driver for BSD etc..?
*gpgpu-3 papers avaiable!
http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-FinalProgram.pdf
*CULA 1.2 avaiable with some eigenvectors/values stuff..

*"GPU Sample Sort" paper for the upcoming IPDPS 2010 conference?
It is possible to achieve much higher sorting rates for NV devices than with the Satish/CUDPP methods. You might be interested in our radix CUDA sorting results here at UVA. We demonstrate 480M pairs/sec, and 550M keys/sec on our GTX285 (with other devices evaluated as well). Interestingly enough, our keys-only results on the NV GT200 architecture are superior to the cycle-accurate sorting results from the (defunct) 32-core Larrabee.
Where is source?

http://www.cs.virginia.edu/~dgm4d/papers/RadixSortTR.pdf

Other sorting new papers:
*Revisiting Sorting for GPGPU Stream Architectures
 "GPU Sample Sort" paper for the upcoming IPDPS 2010
N. Leischner, V. Osipov, and P. Sanders. GPU sample sort. In Proc. Int'l Parallel and Distributed Processing Symposium (IPDPS), to appear, 2010 (currently available at http://arxiv1.library.cornell.edu/abs/0909.5649).

*CUFFT does support streams... and seems has 3d ffts perf improvements of sc08 paper included so
apple fft code seems now work on Nvidia OpenCL but offer 2x-3x perf disadvantage vs cufft..

2d to 3d video conversion:
we have reald and other directshow plugin..
now:
arsoft sim 3d plus hd coming q2..
and powerdvd 10..
*TrueTheater™ Stabilizer
*TrueTheater™ 3D
*TrueTheater Noise Reduction
PowerDVD 10 Mark II: Consumers who purchase PowerDVD 10 Ultra 3D will receive a FREE UPGRADE that enables support of the Blu-ray 3D format and 2D to 3D conversion of video files. Available this summer.
Blu-ray 3D playback requires FREE "Mark II" upgrade which will be available soon.
lot of betas coming:
qt 4.7
intel compiler 12
vmware workstation 7.1
other march:
openrl
heaven 2.0

http://www.cs.utk.edu/~dongarra/WEB-PAGES/cscads-libtune-09/
1st CUDA Developers' Conference
http://www.smithinst.ac.uk/Events/CUDA2009
see
"Looking after the 7 dwarfs: numerical libraries / frameworks for GPUs" Mike Giles
 also
"The Art of Performance Tuning for CUDA and Manycore Architectures"
David Tarjan (NVIDIA)
Kevin Skadron (U. Virginia)
Paulius Micikevicius (NVIDIA)


cudpp 1.1.1 svn has fermi support
cusp has amg geometric multigrid..
http://forums.nvidia.com/index.php?showtopic=163382&st=0&#entry1022104

See DirectX 9.0 on OpenGL ES 2.0 ->http://www.gametree.tv/ linux sdk

Coming in Spring 2010, the GameTree.tv Publishing SDK for Intel CE hardware will include the tools you need to optimize and debug your game for the GameTree.tv Gaming Platform, plus the ability to order Intel CE hardware.Developer Tools & Documentation      available      available
OpenGL ES 1.1 and 2.0
 - Windows Game Development and Emulation
 - Linux Desktop Runtime SDK     available     available
Direct3D® support
 - Fixed-Function
 - Shader Model 1.0 and 2.0 API
 - Linux Desktop Emulation SDK     available     available
Debugging With Visual Studio     Coming March 2010     available
GameTree.tv Developer Forums     Coming Soon     available
Publish Games For Commercial Sale       
Detailed Hardware Setup Documentation   
Hardware Order Process       
Developer Relations Support   
fglrx 8.72.5 has ubuntu 10.4 support and opengl 3.2.97xx (opengl 3.3/40 partial support?)


Nvidia theater GDC notes:
dmm2
dmm2 free 1500 objects (star unleashed not uses more) max, has interop with physx and bullet adds
also directcompute and opencl simulation

shipping september october beta
still not ready plastic simulation and fracture mode.. calculates stress on volume so physical based break..
uses fp32 for gpu support and sse..

3d vision on unreal engine 3 shipping in april..
3d vision sdk soon code samples etc developer tricks for surround
surround recommends gfx400 in sli i "release 256 driver"

khrnos gdc sessions published has
info physics amd opencl sph and soft bodies no rigid bodies this is bullet work..
also fem simulation is dmm2 work..
no more interesting talk slides?: fft profiling for OpenCL by Nvidia employee

physxlab with destruction (precalculated) is beta now with unreal engine 3 integration

new unigine 2.0 this month on 26 has Linux support? and Windows OpenGL tesselation support with Fermi /5xxx cards?
nsight 480gtx 8marzo release
nexus 1.0 opengl and opencl analyzer not hardware debugger but like gdebugger gl+cl


fermi games just cause 2 (d3d10 only) metro 2033 (d3d11 optional)
http://nvidia.fullviewmedia.com/gdc2010/agenda.html
opengl 4.0 extensions viewer and glew in trunk support!
assasins cred2, badcompany 2
ati open 3d
nvidia 3dtv

cuda and visual studio:

QUOTE
- create empty cuda projects trough "project.."

You can just create an ordinary console project and then add .cu files to this project (see next point).

QUOTE
- add new .cu files through "add new item" (renaming c++ or txt in .cu files causes build errors)

If you add the CUDA build rules (Cuda.rules, distributed with the SDK) then VS will automatically detect the .cu files and pass them to nvcc to compile these to standard .obj files, the standard linker (link.exe) will then link these with the rest of your application's .obj files.

QUOTE
- doesn't highlight code in .cu files

See the instructions in (SDK_INSTALL_DIR)\C\doc\syntax_highlighting\visual_studio_8

QUOTE
- must copy a thousand times cutil64.dll around till it releases the program ...

Cutil is used to minimise code replication between the SDK samples. I'd advise understanding what you actually need and implementing it yourself. For example, most people only want the cuda safe call macros and you would be better off handling the error in a manner suitable for your app rather than just calling exit().

QUOTE
- must add a "thousand" new libraries not to cause build errors

By "thousand" do you mean one (cudart.lib)?! Ok, so you're using cutil so you need cutil64.lib too. But by definition using any library (and the CUDA API is provided through a library) you have to link with libraries.

QUOTE
- and even then its not sure if it runs

Can't help with that one (without more info).

I would advise the following.

Preparation:
  • Set up syntax highlighting
  • Set up Intellisense

Development:
  • Create a new, empty, console project (or you can use an existing project if you have one
  • Add your .c, .cpp and .cu files
  • Add the Cuda.rules
  • Modify C/C++ code generation to use /MT in release, /MTd in debug
  • Do the same for the Cuda code generation
  • Add cudart.lib to all configurations (i.e. release and debug)
  • Build, run, debug etc.
 Proceedings of 24th IEEE International Parallel and Distributed Processing Symposium

gpu papers:

Session 2: Scientific Computing with GPUs Improving Numerical Reproducibility and Stability in Large-Scale Numerical Simulations on GPUs
Implementing the Himeno Benchmark with CUDA on GPU Clusters
Direct Self-Consistent Field Computations on GPU ClustersParallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs

A High-Performance Fault-Tolerant Software Framework for Memory on Commodity GPUs

Sort
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs
GPU Sample Sort
Highly Scalable Parallel Sorting

Session 9: Software Support for Using GPUs 26
Object-Oriented Stream Programming using Aspects
Optimal Loop Unrolling For GPGPU Programs
Speculative Execution on Multi-GPU Systems
Dynamic Load Balancing on Single- and Multi-GPU Systems


Fisheye Lens Distortion Correction on Multicore and Hardware Accelerator Platforms .. . . 37
Large-Scale Multi-Dimensional Document Clustering on GPU Clusters

Dynamically Tuned Push-Relabel Algorithm for the Maximum Flow Problem on CPU-GPU-Hybrid Platforms .

Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA

Inter-Block GPU Communication via Fast Barrier Synchronization
    Read More
    Posted in | No comments

    What's left in OpenGL 4.0? and more raw info..

    Posted on 11:01 by Unknown
    Somedays ago OGL 3.3 and 4.0 specs were published and a set of equivalent ARB extensions were put on registry where GLSL 3.3 and 4.0 were released.. now ogl 4.0 compatibility spec is +600 pages long core is 420 pages..

    Other things:
    *OGL 4.0 quick reference card
    http://www.khronos.org/files/opengl4-quick-reference-card.pdf

    *new glext.h and gl3.h updated
    *glloader, glew on svn, and opengl extensions viewer for 3.3/4.0 already support it..
    wait for sdl, smfl ..
    *Waiting Fermi drivers on launch day..

    remember all ARB extension no vendor or EXT..

    No arb extensions included in 3.3/4.0 spec are:

    GL_ARB_shading_language_include
    GL_ARB_texture_compression_bptc
    so HDR D3d11 texture format not required for ogl 4.0..
    also lost is #include in shaders..
    5xxx series include ogl4.0 emulating double on cpu? better with double-float emulation..

    Last Nvidia found:
    GL_EXT_shader_image_load_store
    GL_EXT_vertex_attrib_64bit
    and amd:
    GL_EXT_shader_atomic_counters
    are not found..

    AMD 10.3 includes also first extension blend_func_extended..
    GL_EXT_vertex_attrib_64bit adds vertex attribs:
    so now fp64 is only for uniforms and passing not vertex attribs 
    remember no double rendertargets tex formats simlar to d3d11..
    GL_EXT_shader_image_load_store allow write to random access to texes RWtexture3d
    amd has amdx_random_access_target


    ARB_blend_func_extended is called dual source blending in DX10, but got dropped in DX11..


    We have tesselation shaders, dynamic shader linkage and compute interop with OCL..
    still lacking vs d3d11 is:
    *multi-threaded rendering:
    remember only creation of resources in current drivers.. no parralel command list creation
    is driver or hardware issue?

    *random access load/store/atomic to texes->GL_EXT_shader_image_load_store amdx_random_access_target+GL_EXT_shader_atomic_counters RWtexture3d
    *lacking atomic access to texs and mem barriers in fragment shaders: DeviceMemoryBarrier in d3d11
    *GL_AMD_conservative_depth adds:

    Conservative oDepth - This algorithm allows a pixel shader to compare the per-pixel depth value of the pixel shader with that in the rasterizer. The result enables early depth culling operations while maintaining the ability to output oDepth from a pixel shader.

    So people on OGL forums are criticizing lack of:
    *multi-threaded rendering
    *shader binaries for avoid compilation preferibly crossvendor and plaform as DX IL DXBC (which is almost 100% compatible with ATI's IL)

    *direct state access
    * Epic fail for GL_ARB_sampler_objects as no glsl support..
    I lack:
    *ext_separate_shader_objects
    The ability to separate program objects is only going to become increasingly more relevant.
    *nv_texture_barrier 
    crossprocess texture sharing? 

    Support for programmable offsets in gather is there see 2x speedup in Fermi whitepaper and tesselation
    fermi test would  be good

    fermi:
    196.78 drivers support fermi..
    full support for OGL 4.0 in fermi launch..
    stocasthic transpareny i3d 2010 has fermi perf on this algorithm via ogl sample_shading 10.1 extension
    GLwgl_dx_interop
    GL_NVX_gpu_memory_info
    GL_NV_gpu_program4_1
    published then?
    try openrl with opencl on fermi..
    opencl drivers at fermi launch will have:
    1.cuda 3.0 final
    Fermi Direct3D 11 interoperability
    Fermi HW Profiler support in OpenCL Visual Profiler
    Complete BLAS lib, now with complex routines
    cuda-gdb support for JIT compiled kernels
    add
    C++ Class Inheritance
    C++ Template Inheritance
    Unified interoperability API for Direct3D and OpenGL
    OpenGL texture interoperability

    2 with new opencl driver support:
    *pragma unroll 
    *local atomics
    *icd final
    *d3d9/10/11 support

    fxc interface has interface support but functions inside it how are called 

    see "CUDA_Developer_Guide_for_Optimus_Platforms"
    http://www.stumblingahead.com/blog/?p=66 talking about tesselation soon..

    2010 conferences GPU papers:
    *PPOP
    *GDC 2010
    *I3D 2010
    *GPGPU-3
    *ASPLOS
    MacroSS: Macro-SIMDization of Streaming Applications,
    COMPASS: A Programmable Data Prefetcher Using Idle GPU Shaders,  
    "Investigating the Impact of Code Generation on Performance Characteristics of Integer Programs."
    EUROGRAPHICS 2010
    SIGGRAPH 2010
    Interesting new/coming books:
    *Game Programming Gems 8
    *gpu computing gems 2010?
    *Game Engine Gems 1, Volume One
    *Programming Massively Parallel Processors: A Hands-
    *GPU Pro: Advanced Rendering Techniques 
    *Multigrid Methods on GPUs
    *Game Coding Complete, Third Edition
    *Video Game Optimization
    *Game Engine Architecture
    *Real-Time Cameras
    Programming Game AI by Example

    Comments:
    GL_ARB_shading_language_include-> glsl acepta #include i compilarshaderincludepaths fija <> paths de busqueda

    GL_ARB_texture_compression_bptc
    textures d3d 11 -> compressor incluido mejor offline



    GL_ARB_blend_func_extended
    permite usar dos salidas de fragment shader como color in i blend factors
    mira ejemplo ventana color reflectiva en un paso usando con rops

    GL_ARB_explicit_attrib_location->
    fija en glsl explicito como las variables entre shaders se pasan e

    GL_ARB_occlusion_query2
    permite una boleana para si algo pasa o no

    GL_ARB_sampler_objects
    BindSampler( uint unit, uint sampler );
     When a sampler object is bound to a texture unit, its state supersedes that
        of the texture object bound to that texture unit. If the sampler name zero
        is bound to a texture unit, the currently bound texture's sampler state
        becomes active. A single sampler object may be bound to multiple texture
        units simultaneously.
    no cambia glsl a hlsl con tex.sampler

    GL_ARB_shader_bit_encoding
    con esto puedo usar fast float to int de spap paper kun zhou que coge bits
    de float i haciendo cosas consige abs, float2int de valor ,etc..
    To obtain signed or unsigned integer values holding the encoding of a
        floating-point value, use:

          genIType floatBitsToInt(genType value);
          genUType floatBitsToUint(genType value);

        Conversions are done on a component-by-component basis.

    GL_ARB_texture_rgb10_a2ui
    GL_ARB_texture_swizzle
    GL_ARB_timer_query
    GL_ARB_vertex_type_2_10_10_10_rev


    GL_ARB_draw_indirect
    compute interop
    void DrawArraysIndirect(enum mode, const void *indirect);
    nuevo buffer object
    DRAW_INDIRECT_BUFFER   
    que hay bindeao
    se usa como datos del num elementos etc..
    que no
    pues el puntero indirect se usa?..

    GL_ARB_gpu_shader5
    GL_ARB_gpu_shader_fp64
    Should double-precision fragment shader outputs be supported?

          RESOLVED:  Not in this extension.  Note that we don't have
          double-precision framebuffer formats to accept such values.

    GL_ARB_shader_subroutine
    GL_ARB_tessellation_shader
    GL_ARB_texture_buffer_object_rgb32
    GL_ARB_transform_feedback2
    1.transform feedback objects 
    2.pause and resume transform feedback
    3.ability to draw primitives captured in transform feedback mode without querying the captured
        primitive count
    DrawTransformFeedback()
    GL_ARB_transform_feedback3

    unreal 3 news:
    *palm webos and iphone support (on mac?)
    *3d vision support
    http://www.chw.net/2010/02/29-incomodas-preguntas-para-nvidia-sobre-gf100/
    AMD Open Physics Initiative Expands Ecosystem with Free DMM for Game Production and Updated version of Bullet Physics 
    Apple adopts DirectX 11 GPUs, buys AMD Radeon HD 5750
    apple news:
    *99 dev program
    *valve games to mac next month and monkey island 2 se..
    *6core macpro next week (12 core?)Mac Pro 'hexacore' Xeon Core i7-980x coming Tuesday
    reviews on anandtech 980 gulftown with aes today..
    *amd 5750 imac in june? adds opengl 4.0 and ocl full support for mac..
    so 10.6.4 will support amd 5xxx
    *iphone 4.0 multitasking support
    *10.6.3 this month?
    CUDA:cuda-gdb gpu support and visual profilers,64 bit and efficient gl interop soon?
    http://pasco2010.imag.fr/images/poster_pasco2010.pdf
    http://unlimiteddetailtechnology.com/
    roxio cienplayer 3d
    CLyther = Python + OpenCL
    amd open physics (free dmm 2.0 with ocl) and open stereo(qbf stereo for radeon?)
    also eyefinity sdk coming soon..
    ticker tape avaiable
    pgi insider feb 2010 volume
    http://www.pgroup.com/lit/articles/insider/v2n1a3.htm
    says new fermi support and data region things..
    XNA 4.0 winpho 7 tegra2 soon..
    Yellow Dog Enterprise Linux for CUDA
    http://ydl.net/cuda/iso/YDELforCUDA-6.2-20100302-DVD.iso download free for students
    Jenkins Software Announces Data Mining Tool for Game Developers
    As a further enhancement, AMD has developed new parallel GPU accelerated implementations of Bullet Physics’ Smoothed Particle Hydrodynamics (SPH) Fluids and Soft Bodies/Cloth. The new code written in OpenCL and Direct Compute will be contributed as open source.
     OpenGL usage from an ISV perspective
    intel gpa 3.0
    nity Announces 3.0 Platform, Support For PS3, iPad, And Android
      Valve Confirms Mac Versions Of Steam, Valve Game
    http://www.raknet.net/echochamber
    Erwin Coumans - SONY - Porting existing code to OpenCL
    Ben Gaster AMD and Avi Shapira - Graphic Remedy - Debugging fluid dynamics on OpenCL
    Greg Smith - NVIDIA - FFT and OpenCL Profiling
    http://www.arm.com/community/software-enablement/google/solution-center-android.php
    http://realworldtech.com/forums/index.cfm?action=detail&id=108017&threadid=108017&roomid=2
    I can only say that at CAL level (and obviously OpenCL built upon CAL) there are numerous problems with multiple GPUs.

    Definitely you're need one thread and one context per each GPU to make it working. But it itsn't enough because almost every CAL function isn't thread safe, thus calling calResMap() (which is the only to get access to local GPU memory) in one thread blocks all other threads/contexts.

    And (as I've already wrote at these forums), OpenCL using calCtxWaitForEvent() function instead of CPU burning loop

    while (calCtxIsEventDone(calCtx, e) == CAL_RESULT_PENDING);

    to wait for GPU kernel completion.

    But this calCtxWaitForEvent() also blocks every context currently running. This especially noticeable when there are different devices at system (like 5770+4770). So basically it's simply impossible to asynchronously work with multiple GPUs within single process.


    All above things applies to windows version of CAL, never tried linux one.

    Yup, and I use 1 thread per GPU too. So 1 thread, 1 context, 1 queue for each GPU. I tried other configurations but they weren't working (i.e. not running in parallel).

    Why on HD4870 with 512 MB onboard RAM only 128 available to OpenCL ??? 
    http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=128846&enterthread=y

    MacroSS: Macro-SIMDization of Streaming Applications,
    COMPASS: A Programmable Data Prefetcher Using Idle GPU Shaders,  
    "Investigating the Impact of Code Generation on Performance Characteristics of Integer Programs."
    http://ctk-dev.sourceforge.net/

    gmac
    http://ctk-dev.sourceforge.net
    http://code.google.com/p/fluidic/
    http://otoy.com/
    http://www.gameenginegems.com/
    We're excited to announce a new addition to the Palm® webOS™ development platform: the webOS Plug-in Development Kit (PDK) lets developers extend their webOS applications by writing plug-ins in C or C++. The webOS PDK makes it easy for developers to leverage existing code and exposes new capabilities — including high-performance 3D graphics.
    http://code.google.com/p/gyp/source/checkout
    Read More
    Posted in | No comments

    Sunday, 7 March 2010

    GPU computing toys!

    Posted on 09:12 by Unknown
    Hi I would like to release some lame but hopefully useful tools:
    https://dl.dropbox.com/u/1416327/cld3d.rar

    First OCL D3D interop headers and spec for Nvidia and AMD and a tool for checking current status:
    the headers are in h
    and are for d3d9,10,11 for NV and d3d9,10 for AMD..
    #include for every d3d version and call initcld3d() in your code and voila you have the
    d3d stuff..
    if you #define INCAMD you have even amd functions included and can avoid amd headers..

    with these I have complied four exes named cl_xx_interop which check d3d 9,9Ex,10 and 11..
    they check extension reporting, try to create a shared context in some ways and then associate a d3d object and textures to ocl and aquire and release it prior to use..

    Also cl_d3d10_interop build shows image formats avaiable to OpenCL images see next post..

    Testing OCL-D3D11 interop
    Checking D3D interop extensions support for platform: NVIDIA Corporation
     nv D3D  9 interop extension:  Found.
     nv D3D 10 interop extension:  Found.
     nv D3D 11 interop extension:  Found.

    Using device: GeForce GTX 275
    Enabling texture interop checks: image support is supported.
    clGetDeviceIDsFromD3D11NV pointer: Found
     and it works! (returns d3d associated ocl device)
    clCreateFromD3D11BufferNV pointer: Found
    clCreateFromD3D11Texture2DNV pointer: Found
    clCreateFromD3D11Texture3DNV pointer: Found
    clEnqueueAcquireD3D11ObjectsNV pointer: Found
    clEnqueueReleaseD3D11ObjectsNV pointer: Found
    Testing context creation with
     no dev (clCreateContextFromType): OK.
    dev info (getdeviceids): OK.
    dev info (clGetDeviceIDsFromD3DNV CL_PREFERRED_DEVICES_FOR_D3D9_NV): OK.
    Testing clCreateFromD3D11BufferNV: OK.
    Testing aquire release stuff: Ok.. releasing it: Ok.
    Testing clCreateFromD3D11Texture2DNV: OK.
    Testing aquire release stuff: Ok.. releasing it: Ok.
    Testing clCreateFromD3D11Texture3DNV: OK.
    Testing aquire release stuff: Ok.. releasing it: Ok.



    Also I contains a optd3d which displays the four optional d3d11 features (cap bits):

    In my gtx 200 displays:


    multithreaded comand lists: 0
    multithreaded Concurrent Creates: 1
    Double precision: 0
    Compute Shader: 1

    in ATI 5850 displays:


    multithreaded comand lists: 0
    multithreaded Concurrent Creates: 1
    Double precision: 1
    Compute Shader: 1

    Anyway double prec is not working with loops..
    This shows multithreaded command lists are still not supported by ATI (are this supposed to be a implementation issue or a hardware limitation..)
    Equal to Nvidia and upcoming Fermi..

    I include a CLinfo not mine but for checking CL info..

    report.bat create a report.txt with the info of all this executables..
    I also include 2dbench for cheking GDI in Windows 7 perf issues.. AMD will fix in Catalyst 10.4..

    There is a high efficient matmul for CUDA and AMD cards and peakflops for AMD cards..

    %
    %  compute C = A*B, A:mxk, B:kxn, C:mxn
    %
    %  cubin file = ../method1/decuda_ldsb32_cudasm.cubin
    %  kernel function = method1_variant_sgemmNN
    %  use device: GeForce GTX 275
    %  m=n=k    gpu_time (ms)   flops (Gflops/s)
         32         0.044         1.391
        128         0.120        32.451
        224         0.194       107.870
        320         0.302       201.802
        416         0.445       301.033
        512         0.619       403.979
        608         1.277       327.914
        704         1.582       410.719
        800         2.618       364.210
        896         3.135       427.439
        992         4.401       413.123
       1088         6.014       398.868
       1184         6.981       442.860
       1280         8.751       446.365
       1376        10.911       444.746
       1472        13.403       443.262
       1568        16.377       438.470
       1664        18.901       454.051
       1760        22.437       452.594
       1856        25.820       461.218
       1952        31.233       443.566
       2048        33.317       480.229
       2144        39.834       460.841
       2240        44.989       465.337
       2336        51.643       459.765
       2432        56.514       474.095
       2528        64.183       468.859
       2624        72.540       463.923
       2720        79.686       470.387
       2816        85.826       484.626
       2912        96.003       479.094
       3008       108.801       465.942
       3104       121.579       458.181
       3200       126.446       482.699
       3296       138.522       481.473
       3392       153.544       473.440
       3488       168.797       468.268
       3584       177.873       482.085
       3680       193.298       480.227
       3776       212.160       472.675
       3872       229.596       470.947
       3968       246.403       472.280
       4064       260.086       480.699
    clock 1620
    %  m=n=k    gpu_time (ms)   flops (Gflops/s)
         32         0.040         1.516
        128         0.108        36.044
        224         0.173       120.900
        320         0.265       229.925
        416         0.393       341.338
        512         0.535       467.090
        608         1.107       378.021
        704         1.371       474.163
        800         2.270       420.030
        896         2.751       486.983
        992         3.804       477.992
       1088         5.205       460.925
       1184         6.003       514.983
       1280         7.609       513.393
       1376         9.396       516.463
       1472        11.555       514.134
       1568        14.145       507.666
       1664        16.427       522.442
       1760        19.387       523.784
       1856        22.182       536.854
       1952        26.860       515.777
       2048        28.642       558.623
       2144        34.530       531.627
       2240        39.585       528.868
       2336        44.440       534.292
       2432        49.141       545.226
       2528        55.274       544.429
       2624        63.241       532.134
       2720        68.451       547.592
       2816        74.160       560.865
       2912        82.945       554.516
       3008        94.150       538.449
       3104       104.581       532.653
       3200       108.907       560.436
       3296       119.277       559.158
       3392       131.982       550.785
       3488       146.003       541.376
       3584       154.088       556.502
       3680       166.307       558.166
       3776       184.523       543.469
       3872       198.692       544.196
       3968       214.158       543.390
       4064       223.720       558.838

    it's a cubin so will not work in fermi
    5850 stock

    flopspeak.exe
    Device            0
    target            8
    localRAM          1024 MB
    uncachedRemoteRAM 2047 MB
    cachedRemoteRAM   2047 MB
    engineClock       725 MHz
    memoryClock       1000 MHz
    wavefrontSize     64
    numberOfSIMD      18
    doublePrecision   1
    localDataShare    1
    globalDataShare   1
    globalGPR         1
    computeShader     1
    memExport         1
    pitch_alignment   256
    surface_alignment 4096
    Device 0: execution time 7913.45 ms, achieved 2041.80 gflops
    oc 950mhz

    flopspeak.exe

    engineClock       950 MHz
    memoryClock       1000 MHz

    Device 0: execution time 6039.35 ms, achieved 2675.40 gflops



    matmul.exe 2048 2048 100

    Device 0: execution time 1415.08 ms, achieved 1214.06 gflops
    oc 950mhz
    Device 0: execution time 1114.06 ms, achieved 1542.09 gflops

    UPDATE 1:
    Nvidia and ATI working together!
    opencl.dll from ati sdk 2.01

    Found 2 platform(s).
    platform[01104BA0]: profile: FULL_PROFILE
    platform[01104BA0]: version: OpenCL 1.0 CUDA 3.0.1
    platform[01104BA0]: name: NVIDIA CUDA
    platform[01104BA0]: vendor: NVIDIA Corporation
    platform[01104BA0]: extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_
    gl_sharing cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_nv_d3d11_sharing cl_nv_comp
    iler_options cl_nv_device_attribute_query cl_nv_pragma_unroll
    platform[01104BA0]: Found 1 device(s).
            device[01104C08]: NAME: GeForce GTX 275
            device[01104C08]: VENDOR: NVIDIA Corporation
            device[01104C08]: PROFILE: FULL_PROFILE
            device[01104C08]: VERSION: OpenCL 1.0 CUDA
            device[01104C08]: EXTENSIONS: cl_khr_byte_addressable_store cl_khr_icd c
    l_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_nv_d3d11_sharing cl_n
    v_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll  cl_khr_glob
    al_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_ba
    se_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64
            device[01104C08]: DRIVER_VERSION: 196.75

            device[01104C08]: Type: GPU
            device[01104C08]: EXECUTION_CAPABILITIES: Kernel
            device[01104C08]: GLOBAL_MEM_CACHE_TYPE: None (0)
            device[01104C08]: CL_DEVICE_LOCAL_MEM_TYPE: Local (1)
            device[01104C08]: SINGLE_FP_CONFIG: 0x3e
            device[01104C08]: QUEUE_PROPERTIES: 0x3

            device[01104C08]: VENDOR_ID: 4318
            device[01104C08]: MAX_COMPUTE_UNITS: 30
            device[01104C08]: MAX_WORK_ITEM_DIMENSIONS: 3
            device[01104C08]: MAX_WORK_GROUP_SIZE: 512
            device[01104C08]: PREFERRED_VECTOR_WIDTH_CHAR: 1
            device[01104C08]: PREFERRED_VECTOR_WIDTH_SHORT: 1
            device[01104C08]: PREFERRED_VECTOR_WIDTH_INT: 1
            device[01104C08]: PREFERRED_VECTOR_WIDTH_LONG: 1
            device[01104C08]: PREFERRED_VECTOR_WIDTH_FLOAT: 1
            device[01104C08]: PREFERRED_VECTOR_WIDTH_DOUBLE: 1
            device[01104C08]: MAX_CLOCK_FREQUENCY: 1404
            device[01104C08]: ADDRESS_BITS: 32
            device[01104C08]: MAX_MEM_ALLOC_SIZE: 229998592
            device[01104C08]: IMAGE_SUPPORT: 1
            device[01104C08]: MAX_READ_IMAGE_ARGS: 128
            device[01104C08]: MAX_WRITE_IMAGE_ARGS: 8
            device[01104C08]: IMAGE2D_MAX_WIDTH: 8192
            device[01104C08]: IMAGE2D_MAX_HEIGHT: 8192
            device[01104C08]: IMAGE3D_MAX_WIDTH: 2048
            device[01104C08]: IMAGE3D_MAX_HEIGHT: 2048
            device[01104C08]: IMAGE3D_MAX_DEPTH: 2048
            device[01104C08]: MAX_SAMPLERS: 16
            device[01104C08]: MAX_PARAMETER_SIZE: 4352
            device[01104C08]: MEM_BASE_ADDR_ALIGN: 256
            device[01104C08]: MIN_DATA_TYPE_ALIGN_SIZE: 16
            device[01104C08]: GLOBAL_MEM_CACHELINE_SIZE: 0
            device[01104C08]: GLOBAL_MEM_CACHE_SIZE: 0
            device[01104C08]: GLOBAL_MEM_SIZE: 919994368
            device[01104C08]: MAX_CONSTANT_BUFFER_SIZE: 65536
            device[01104C08]: MAX_CONSTANT_ARGS: 9
            device[01104C08]: LOCAL_MEM_SIZE: 16384
            device[01104C08]: ERROR_CORRECTION_SUPPORT: 0
            device[01104C08]: PROFILING_TIMER_RESOLUTION: 1000
            device[01104C08]: ENDIAN_LITTLE: 1
            device[01104C08]: AVAILABLE: 1
            device[01104C08]: COMPILER_AVAILABLE: 1
    platform[0313A434]: profile: FULL_PROFILE
    platform[0313A434]: version: OpenCL 1.0 ATI-Stream-v2.0.1
    platform[0313A434]: name: ATI Stream
    platform[0313A434]: vendor: Advanced Micro Devices, Inc.
    platform[0313A434]: extensions: cl_khr_icd
    platform[0313A434]: Found 2 device(s).
            device[0338CA70]: NAME: Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz
            device[0338CA70]: VENDOR: GenuineIntel
            device[0338CA70]: PROFILE: FULL_PROFILE
            device[0338CA70]: VERSION: OpenCL 1.0 ATI-Stream-v2.0.1
            device[0338CA70]: EXTENSIONS: cl_khr_icd cl_khr_global_int32_base_atomic
    s cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_lo
    cal_int32_extended_atomics cl_khr_byte_addressable_store
            device[0338CA70]: DRIVER_VERSION: 1.0

            device[0338CA70]: Type: CPU
            device[0338CA70]: EXECUTION_CAPABILITIES: Kernel
            device[0338CA70]: GLOBAL_MEM_CACHE_TYPE: Read-Write (2)
            device[0338CA70]: CL_DEVICE_LOCAL_MEM_TYPE: Global (2)
            device[0338CA70]: SINGLE_FP_CONFIG: 0x7
            device[0338CA70]: QUEUE_PROPERTIES: 0x2

            device[0338CA70]: VENDOR_ID: 4098
            device[0338CA70]: MAX_COMPUTE_UNITS: 8
            device[0338CA70]: MAX_WORK_ITEM_DIMENSIONS: 3
            device[0338CA70]: MAX_WORK_GROUP_SIZE: 1024
            device[0338CA70]: PREFERRED_VECTOR_WIDTH_CHAR: 16
            device[0338CA70]: PREFERRED_VECTOR_WIDTH_SHORT: 8
            device[0338CA70]: PREFERRED_VECTOR_WIDTH_INT: 4
            device[0338CA70]: PREFERRED_VECTOR_WIDTH_LONG: 2
            device[0338CA70]: PREFERRED_VECTOR_WIDTH_FLOAT: 4
            device[0338CA70]: PREFERRED_VECTOR_WIDTH_DOUBLE: 0
            device[0338CA70]: MAX_CLOCK_FREQUENCY: 2698
            device[0338CA70]: ADDRESS_BITS: 32
            device[0338CA70]: MAX_MEM_ALLOC_SIZE: 536870912
            device[0338CA70]: IMAGE_SUPPORT: 0
            device[0338CA70]: MAX_READ_IMAGE_ARGS: 0
            device[0338CA70]: MAX_WRITE_IMAGE_ARGS: 0
            device[0338CA70]: IMAGE2D_MAX_WIDTH: 0
            device[0338CA70]: IMAGE2D_MAX_HEIGHT: 0
            device[0338CA70]: IMAGE3D_MAX_WIDTH: 0
            device[0338CA70]: IMAGE3D_MAX_HEIGHT: 0
            device[0338CA70]: IMAGE3D_MAX_DEPTH: 0
            device[0338CA70]: MAX_SAMPLERS: 0
            device[0338CA70]: MAX_PARAMETER_SIZE: 4096
            device[0338CA70]: MEM_BASE_ADDR_ALIGN: 32768
            device[0338CA70]: MIN_DATA_TYPE_ALIGN_SIZE: 128
            device[0338CA70]: GLOBAL_MEM_CACHELINE_SIZE: 64
            device[0338CA70]: GLOBAL_MEM_CACHE_SIZE: 65536
            device[0338CA70]: GLOBAL_MEM_SIZE: 1073741824
            device[0338CA70]: MAX_CONSTANT_BUFFER_SIZE: 65536
            device[0338CA70]: MAX_CONSTANT_ARGS: 8
            device[0338CA70]: LOCAL_MEM_SIZE: 32768
            device[0338CA70]: ERROR_CORRECTION_SUPPORT: 0
            device[0338CA70]: PROFILING_TIMER_RESOLUTION: 1
            device[0338CA70]: ENDIAN_LITTLE: 1
            device[0338CA70]: AVAILABLE: 1
            device[0338CA70]: COMPILER_AVAILABLE: 1
            device[04A30050]: NAME: Cypress
            device[04A30050]: VENDOR: Advanced Micro Devices, Inc.
            device[04A30050]: PROFILE: FULL_PROFILE
            device[04A30050]: VERSION: OpenCL 1.0 ATI-Stream-v2.0.1
            device[04A30050]: EXTENSIONS: cl_khr_global_int32_base_atomics cl_khr_gl
    obal_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_e
    xtended_atomics
            device[04A30050]: DRIVER_VERSION: CAL 1.4.556

            device[04A30050]: Type: GPU
            device[04A30050]: EXECUTION_CAPABILITIES: Kernel
            device[04A30050]: GLOBAL_MEM_CACHE_TYPE: None (0)
            device[04A30050]: CL_DEVICE_LOCAL_MEM_TYPE: Local (1)
            device[04A30050]: SINGLE_FP_CONFIG: 0x6
            device[04A30050]: QUEUE_PROPERTIES: 0x2

            device[04A30050]: VENDOR_ID: 4098
            device[04A30050]: MAX_COMPUTE_UNITS: 18
            device[04A30050]: MAX_WORK_ITEM_DIMENSIONS: 3
            device[04A30050]: MAX_WORK_GROUP_SIZE: 256
            device[04A30050]: PREFERRED_VECTOR_WIDTH_CHAR: 16
            device[04A30050]: PREFERRED_VECTOR_WIDTH_SHORT: 8
            device[04A30050]: PREFERRED_VECTOR_WIDTH_INT: 4
            device[04A30050]: PREFERRED_VECTOR_WIDTH_LONG: 2
            device[04A30050]: PREFERRED_VECTOR_WIDTH_FLOAT: 4
            device[04A30050]: PREFERRED_VECTOR_WIDTH_DOUBLE: 0
            device[04A30050]: MAX_CLOCK_FREQUENCY: 725
            device[04A30050]: ADDRESS_BITS: 32
            device[04A30050]: MAX_MEM_ALLOC_SIZE: 268435456
            device[04A30050]: IMAGE_SUPPORT: 0
            device[04A30050]: MAX_READ_IMAGE_ARGS: 0
            device[04A30050]: MAX_WRITE_IMAGE_ARGS: 0
            device[04A30050]: IMAGE2D_MAX_WIDTH: 0
            device[04A30050]: IMAGE2D_MAX_HEIGHT: 0
            device[04A30050]: IMAGE3D_MAX_WIDTH: 0
            device[04A30050]: IMAGE3D_MAX_HEIGHT: 0
            device[04A30050]: IMAGE3D_MAX_DEPTH: 0
            device[04A30050]: MAX_SAMPLERS: 0
            device[04A30050]: MAX_PARAMETER_SIZE: 1024
            device[04A30050]: MEM_BASE_ADDR_ALIGN: 4096
            device[04A30050]: MIN_DATA_TYPE_ALIGN_SIZE: 128
            device[04A30050]: GLOBAL_MEM_CACHELINE_SIZE: 0
            device[04A30050]: GLOBAL_MEM_CACHE_SIZE: 0
            device[04A30050]: GLOBAL_MEM_SIZE: 268435456
            device[04A30050]: MAX_CONSTANT_BUFFER_SIZE: 65536
            device[04A30050]: MAX_CONSTANT_ARGS: 8
            device[04A30050]: LOCAL_MEM_SIZE: 32768
            device[04A30050]: ERROR_CORRECTION_SUPPORT: 0
            device[04A30050]: PROFILING_TIMER_RESOLUTION: 1
            device[04A30050]: ENDIAN_LITTLE: 1
            device[04A30050]: AVAILABLE: 1
            device[04A30050]: COMPILER_AVAILABLE: 1
    UPDATE 2:
    DX formats included in optd3d
    Read More
    Posted in | No comments

    GPGPU Image support!

    Posted on 08:44 by Unknown
    1. D3D
    In doc there is a table "Hardware Support for Direct3D 11 Formats"

    Format(DXGI_FORMAT_*) # Bits Format Target
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
    UNKNOWN 0 X X X
      R32G32B32A32_TYPELESS 128 X X X X X X X
        R32G32B32A32_FLOAT 128 X X X X X X X X X X X X X X X X X X o o X X X
        R32G32B32A32_UINT 128 X X X X X X X X X X X X X X o o X X
        R32G32B32A32_SINT 128 X X X X X X X X X X X X X X o o X X
      R32G32B32_TYPELESS 96 X X X X X X X
        R32G32B32_FLOAT 96 X X X X X X X X o o X o o o1 X X X o X X X
        R32G32B32_UINT 96 X X X X X X X X X o X X X o X X
        R32G32B32_SINT 96 X X X X X X X X X o X X X o X X
      R16G16B16A16_TYPELESS 64 X X X X X X X
        R16G16B16A16_FLOAT 64 X X X X X X X X X X X X X X X X X X o X X X X
        R16G16B16A16_UNORM 64 X X X X X X X X X X X X X X X X X X o X X X
        R16G16B16A16_UINT 64 X X X X X X X X X X X X X X o X X
        R16G16B16A16_SNORM 64 X X X X X X X X X X X X X X X X X X o X X X
        R16G16B16A16_SINT 64 X X X X X X X X X X X X X X o X X
      R32G32_TYPELESS 64 X X X X X X X
        R32G32_FLOAT 64 X X X X X X X X X X X X X X X X X X X o X X X
        R32G32_UINT 64 X X X X X X X X X X X X X X X o X X
        R32G32_SINT 64 X X X X X X X X X X X X X X X o X X
      R32G8X24_TYPELESS 64 X X X X X X
        D32_FLOAT_S8X24_UINT 64 X X X X X X X X o X
        R32_FLOAT_X8X24_TYPELESS 64 X X X X X X X X X X X X
        X32_TYPELESS_G8X24_UINT 64 X X X X X X X X
      R10G10B10A2_TYPELESS 32 X X X X X X X
        R10G10B10A2_UNORM 32 X X X X X X X X X X X X X X X X X X o X X X X
        R10G10B10A2_UINT 32 X X X X X X X X X X X X X X o X X
        R10G10B10_XR_BIAS_A2_UNORM 32 X X X X
      R11G11B10_FLOAT 32 X X X X X X X X X X X X X X X X X X o X X
      R8G8B8A8_TYPELESS 32 X X X X X X X
        R8G8B8A8_UNORM 32 X X X X X X X X X X X X X X X X X X o X X X X
        R8G8B8A8_UNORM_SRGB 32 X X X X X X X X X X X X X X o X X X X
        R8G8B8A8_UINT 32 X X X X X X X X X X X X X X o X X
        R8G8B8A8_SNORM 32 X X X X X X X X X X X X X X X X X X o X X X
        R8G8B8A8_SINT 32 X X X X X X X X X X X X X X o X X
      R16G16_TYPELESS 32 X X X X X X X
        R16G16_FLOAT 32 X X X X X X X X X X X X X X X X X X o X X X
        R16G16_UNORM 32 X X X X X X X X X X X X X X X X X X o X X X
        R16G16_UINT 32 X X X X X X X X X X X X X X o X X
        R16G16_SNORM 32 X X X X X X X X X X X X X X X X X X o X X X
        R16G16_SINT 32 X X X X X X X X X X X X X X o X X
      R32_TYPELESS 32 X X X X X X X X
        D32_FLOAT 32 X X X X X X X X o X
        R32_FLOAT 32 X X X X X X X X X X X X X X X X X X X X X X X o X X X
        R32_UINT 32 X X X X X X X X X X X X X X X X X X X X X X X o X X
        R32_SINT 32 X X X X X X X X X X X X X X X X X X X X X X o X X
      R24G8_TYPELESS 32 X X X X X X
        D24_UNORM_S8_UINT 32 X X X X X X X X o X
        R24_UNORM_X8_TYPELESS 32 X X X X X X X X X X X X
        X24_TYPELESS_G8_UINT 32 X X X X X X X X
      R8G8_TYPELESS 16 X X X X X X X
        R8G8_UNORM 16 X X X X X X X X X X X X X X X X X X o X X X
        R8G8_UINT 16 X X X X X X X X X X X X X X o X X
        R8G8_SNORM 16 X X X X X X X X X X X X X X X X X X o X X X
        R8G8_SINT 16 X X X X X X X X X X X X X X o X X
      R16_TYPELESS 16 X X X X X X X
        R16_FLOAT 16 X X X X X X X X X X X X X X X X X X o X X X
        D16_UNORM 16 X X X X X X X X o X
        R16_UNORM 16 X X X X X X X X X X X X X X X X X X X X o X X X
        R16_UINT 16 X X X X X X X X X X X X X X X o X X
        R16_SNORM 16 X X X X X X X X X X X X X X X X X X o X X X
        R16_SINT 16 X X X X X X X X X X X X X X o X X
      R8_TYPELESS 8 X X X X X X X
        R8_UNORM 8 X X X X X X X X X X X X X X X X X X o X X X
        R8_UINT 8 X X X X X X X X X X X X X X o X X
        R8_SNORM 8 X X X X X X X X X X X X X X X X X X o X X X
        R8_SINT 8 X X X X X X X X X X X X X X o X X
      A8_UNORM 8 X X X X X X X X X X X X X X X X o X X
      R9G9B9E5_SHAREDEXP 32 X X X X X X X X X
      R8G8_B8G8_UNORM 16 X X X X X X X X X
      G8R8_G8B8_UNORM 16 X X X X X X X X X
      BC1_TYPELESS 4 X X X X X X
        BC1_UNORM 4 X X X X X X X X X
        BC1_UNORM_SRGB 4 X X X X X X X X X
      BC2_TYPELESS 8 X X X X X X
        BC2_UNORM 8 X X X X X X X X X
        BC2_UNORM_SRGB 8 X X X X X X X X X
      BC3_TYPELESS 8 X X X X X X
        BC3_UNORM 8 X X X X X X X X X
        BC3_UNORM_SRGB 8 X X X X X X X X X
      BC4_TYPELESS 4 X X X X X X
        BC4_UNORM 4 X X X X X X X X X
        BC4_SNORM 4 X X X X X X X X X
      BC5_TYPELESS 8 X X X X X X
        BC5_UNORM 8 X X X X X X X X X
        BC5_SNORM 8 X X X X X X X X X
      B8G8R8A8_TYPELESS 32 X X X X X X X
        B8G8R8A8_UNORM 32 X X X X X X X X X X X X X X X X o X X X X
        B8G8R8A8_UNORM_SRGB 32 X X X X X X X X X X X X X X o X X X X
      B8G8R8X8_TYPELESS 32 X X X X X X X
        B8G8R8X8_UNORM 32 X X X X X X X X X X X X X X X X o X X X
        B8G8R8X8_UNORM_SRGB 32 X X X X X X X X X X X X X X o X X X
      BC6H_TYPELESS 8 X X X X X X
        BC6H_UF16 8 X X X X X X X X X
        BC6H_SF16 8 X X X X X X X X X
      BC7_TYPELESS 8 X X X X X X
        BC7_UNORM 8 X X X X X X X X X
        BC7_UNORM_SRGB 8 X X X X X X X X X


    1. Buffer
    2. Input Assembler Vertex Buffer
    3. Input Assembler Index Buffer
    4. Stream Output Buffer
    5. Texture1D
    6. Texture2D
    7. Texture3D
    8. TextureCube
    9. Shader ld
    10. Shader sample (any filter)
    11. Shader sample_c (comparison filter)
    12. Shader sample (mono 1-bit filter)
    13. Shader gather4
    14. Shader gather4_c
    15. Mipmap
    16. Mipmap Auto-Generation
    17. RenderTarget
    18. Blendable RenderTarget
    19. Depth/Stencil Target
    20. Raw UAV and SRV
    21. Structured UAV and SRV
    22. Typed UAV
    23. UAV Typed Store
    24. UAV Typed Load
    25. UAV Atomic Add
    26. UAV Atomic Bitwise Ops
    27. UAV Atomic Cmp Store or Cmp Exch
    28. UAV Atomic Exchange
    29. UAV Atomic Signed Min or Max
    30. UAV Atomic Unsigned Min or Max
    31. CPU Lockable
    32. 4x Multisample RenderTarget
    33. 8x Multisample RenderTarget
    34. Other Multisample Count RT
    35. Multisample Resolve
    36. Multisample Load
    37. Display Scan-Out
    38. Cast Within Bit Layout 
    A API for getting supported formats is ID3D11Device::CheckFormatSupport..

    Would be good to write a program for checking the formats supported by AMD and Nvidia..


    In CUDA 3.0 you have:

    1, 2 or 4 components:
    *Signed or unsigned 8-, 16- or 32-bit integers (18)
    *16-bit floats (currently only supported through the driver (6)
    API), or 32-bit floats
    24 tex formats


    For CUDA-GL interop (from forums):

    works for FP textures:
    XXXX = R,RG,RGB or RGBA
    YY = 16 or 32
    i.e. 8 FP formats
    works for integer texes:
    XXXX = R,G,RGB or RGBA
    YY = 8,16 or 32
    ZZ = I or UI
    i.e. 24 FP formats
    depth renderbuffers doesn't work I don't know if color renderbuffers work I assume yes at least for CUDA 3.0 final..

    use:
    glGenTextures(1,&tex);
    glBindTexture(GL_TEXTURE_2D , tex);
    glTexImage2D(GL_TEXTURE_2D , 0 , GL_XXXXYYF , width , height , 0 , GL_RGBA , GL_FLOAT , 0);
    glTexParameteri(GL_TEXTURE_2D , GL_TEXTURE_MIN_FILTER , GL_NEAREST);
    cudaGraphicsGLRegisterImage (&resource , tex , GL_TEXTURE_2D , cudaGraphicsMapFlagsNone);
    for integer texes change to that:

    glTexImage2D(GL_TEXTURE_2D , 0 , GL_XXXXYYZZ , width , height , 0 , GL_RGBA_INTEGER , GL_UNSIGNED_BYTE , 0);
    notes:
    Notice that it is important to set the minification filter to GL_NEAREST.
    In conclusion, it looks like the cudaGraphicsGL interface is working for most formats, excluding normalized internal formats such as the commonly used GL_RGBA8 format.

    cuda GL allows to use RGB texes altough CUDA seems not from DOC!


    OCL DX interop for Nvidia:

    ------------------------------------------------------------------
        DXGI Format                      cl_channel_order  cl_channel_type  
        ------------------------------   ----------------  ---------------
        DXGI_FORMAT_R32G32B32A32_FLOAT   CL_RGBA           CL_FLOAT
        DXGI_FORMAT_R32G32B32A32_UINT    CL_RGBA           CL_UNSIGNED_INT32
        DXGI_FORMAT_R32G32B32A32_SINT    CL_RGBA           CL_SIGNED_INT32
        DXGI_FORMAT_R16G16B16A16_FLOAT   CL_RGBA           CL_HALF_FLOAT
        DXGI_FORMAT_R16G16B16A16_UNORM   CL_RGBA           CL_UNORM_INT16
        DXGI_FORMAT_R16G16B16A16_UINT    CL_RGBA           CL_UNSIGNED_INT16
        DXGI_FORMAT_R16G16B16A16_SNORM   CL_RGBA           CL_SNORM_INT16
        DXGI_FORMAT_R16G16B16A16_SINT    CL_RGBA           CL_SIGNED_INT16
        DXGI_FORMAT_R8G8B8A8_UNORM       CL_RGBA           CL_UNORM_INT8
        DXGI_FORMAT_R8G8B8A8_UINT        CL_RGBA           CL_UNSIGNED_INT8
        DXGI_FORMAT_R8G8B8A8_SNORM       CL_RGBA           CL_SNORM_INT8
        DXGI_FORMAT_R8G8B8A8_SINT        CL_RGBA           CL_SIGNED_INT8
        DXGI_FORMAT_R32G32_FLOAT         CL_RG             CL_FLOAT
        DXGI_FORMAT_R32G32_UINT          CL_RG             CL_UNSIGNED_INT32
        DXGI_FORMAT_R32G32_SINT          CL_RG             CL_SIGNED_INT32
        DXGI_FORMAT_R16G16_FLOAT         CL_RG             CL_HALF_FLOAT
        DXGI_FORMAT_R16G16_UNORM         CL_RG             CL_UNORM_INT16
        DXGI_FORMAT_R16G16_UINT          CL_RG             CL_UNSIGNED_INT16
        DXGI_FORMAT_R16G16_SNORM         CL_RG             CL_SNORM_INT16
        DXGI_FORMAT_R16G16_SINT          CL_RG             CL_SIGNED_INT16
        DXGI_FORMAT_R8G8_UNORM           CL_RG             CL_UNORM_INT8
        DXGI_FORMAT_R8G8_UINT            CL_RG             CL_UNSIGNED_INT8
        DXGI_FORMAT_R8G8_SNORM           CL_RG             CL_SNORM_INT8
        DXGI_FORMAT_R8G8_SINT            CL_RG             CL_SIGNED_INT8
        DXGI_FORMAT_R32_FLOAT            CL_R              CL_FLOAT
        DXGI_FORMAT_R32_UINT             CL_R              CL_UNSIGNED_INT32
        DXGI_FORMAT_R32_SINT             CL_R              CL_SIGNED_INT32
        DXGI_FORMAT_R16_FLOAT            CL_R              CL_HALF_FLOAT
        DXGI_FORMAT_R16_UNORM            CL_R              CL_UNORM_INT16
        DXGI_FORMAT_R16_UINT             CL_R              CL_UNSIGNED_INT16
        DXGI_FORMAT_R16_SNORM            CL_R              CL_SNORM_INT16
        DXGI_FORMAT_R16_SINT             CL_R              CL_SIGNED_INT16
        DXGI_FORMAT_R8_UNORM             CL_R              CL_UNORM_INT8
        DXGI_FORMAT_R8_UINT              CL_R              CL_UNSIGNED_INT8
        DXGI_FORMAT_R8_SNORM             CL_R              CL_SNORM_INT8
        DXGI_FORMAT_R8_SINT              CL_R              CL_SIGNED_INT8


    OCL supported textures see my program:
    clGetSupportedImageFormats
    use

     void getimageinfo(cl_context context,cl_mem_flags m,cl_mem_object_type te)
    {
    size_t num_entries;  cl_image_format *image_formats;
    cl_int status=clGetSupportedImageFormats (context,m,te,0,NULL,&num_entries);
    if(status==CL_SUCCESS&&num_entries>0)
    {
    image_formats=(cl_image_format*)malloc(num_entries*sizeof(cl_image_format));
    status=clGetSupportedImageFormats (context,m,te,num_entries,image_formats,NULL);
    if(status==CL_SUCCESS)
    {
    int o,t;
    int i,j;
    cl_int orders[]={CL_R,  CL_A,CL_INTENSITY, CL_LUMINANCE,CL_RG,  CL_RA,CL_RGB,CL_RGBA,CL_ARGB, CL_BGRA};
    char  *or[]={"CL_R",  "CL_A","CL_INTENSITY", "CL_LUMINANCE","CL_RG",  "CL_RA","CL_RGB","CL_RGBA","CL_ARGB", "CL_BGRA"};
    cl_int types[]={
    CL_SNORM_INT8 , CL_SNORM_INT16, CL_UNORM_INT8, CL_UNORM_INT16, CL_UNORM_SHORT_565, CL_UNORM_SHORT_555, CL_UNORM_INT_101010,CL_SIGNED_INT8,
    CL_SIGNED_INT16,  CL_SIGNED_INT32, CL_UNSIGNED_INT8, CL_UNSIGNED_INT16, CL_UNSIGNED_INT32, CL_HALF_FLOAT, CL_FLOAT};
    char * tt[]={"CL_SNORM_INT8" ,"CL_SNORM_INT16","CL_UNORM_INT8","CL_UNORM_INT16","CL_UNORM_SHORT_565","CL_UNORM_SHORT_555","CL_UNORM_INT_101010",
    "CL_SIGNED_INT8","CL_SIGNED_INT16","CL_SIGNED_INT32","CL_UNSIGNED_INT8","CL_UNSIGNED_INT16","CL_UNSIGNED_INT32","CL_HALF_FLOAT","CL_FLOAT"};
    for(i=0; i
    {
    for(j=0; j
    {
    if(image_formats[i].image_channel_order==orders[j])
    o=j;
    }
    for(j=0; j
    {
    if(image_formats[i].image_channel_data_type==types[j])
    t=j;
    }
    printf("Format %d: %s, %s\n",i,or[o],tt[t]);
    }
    }
    free(image_formats);
    }
    }

    AMD and Nvidia return same for all args cl_mem_flags flags read or write only and cl_mem_object_type image_type set to 2d or 3d.. perhaps 3d write could report 0?

    Nvidia:

    Format 0: CL_R, CL_FLOAT
    Format 1: CL_R, CL_HALF_FLOAT
    Format 2: CL_R, CL_UNORM_INT8
    Format 3: CL_R, CL_UNORM_INT16
    Format 4: CL_R, CL_SNORM_INT16
    Format 5: CL_R, CL_SIGNED_INT8
    Format 6: CL_R, CL_SIGNED_INT16
    Format 7: CL_R, CL_SIGNED_INT32
    Format 8: CL_R, CL_UNSIGNED_INT8
    Format 9: CL_R, CL_UNSIGNED_INT16
    Format 10: CL_R, CL_UNSIGNED_INT32
    Format 11: CL_A, CL_FLOAT
    Format 12: CL_A, CL_HALF_FLOAT
    Format 13: CL_A, CL_UNORM_INT8
    Format 14: CL_A, CL_UNORM_INT16
    Format 15: CL_A, CL_SNORM_INT16
    Format 16: CL_A, CL_SIGNED_INT8
    Format 17: CL_A, CL_SIGNED_INT16
    Format 18: CL_A, CL_SIGNED_INT32
    Format 19: CL_A, CL_UNSIGNED_INT8
    Format 20: CL_A, CL_UNSIGNED_INT16
    Format 21: CL_A, CL_UNSIGNED_INT32
    Format 22: CL_RG, CL_FLOAT
    Format 23: CL_RG, CL_HALF_FLOAT
    Format 24: CL_RG, CL_UNORM_INT8
    Format 25: CL_RG, CL_UNORM_INT16
    Format 26: CL_RG, CL_SNORM_INT16
    Format 27: CL_RG, CL_SIGNED_INT8
    Format 28: CL_RG, CL_SIGNED_INT16
    Format 29: CL_RG, CL_SIGNED_INT32
    Format 30: CL_RG, CL_UNSIGNED_INT8
    Format 31: CL_RG, CL_UNSIGNED_INT16
    Format 32: CL_RG, CL_UNSIGNED_INT32
    Format 33: CL_RA, CL_FLOAT
    Format 34: CL_RA, CL_HALF_FLOAT
    Format 35: CL_RA, CL_UNORM_INT8
    Format 36: CL_RA, CL_UNORM_INT16
    Format 37: CL_RA, CL_SNORM_INT16
    Format 38: CL_RA, CL_SIGNED_INT8
    Format 39: CL_RA, CL_SIGNED_INT16
    Format 40: CL_RA, CL_SIGNED_INT32
    Format 41: CL_RA, CL_UNSIGNED_INT8
    Format 42: CL_RA, CL_UNSIGNED_INT16
    Format 43: CL_RA, CL_UNSIGNED_INT32
    Format 44: CL_RGBA, CL_FLOAT
    Format 45: CL_RGBA, CL_HALF_FLOAT
    Format 46: CL_RGBA, CL_UNORM_INT8
    Format 47: CL_RGBA, CL_UNORM_INT16
    Format 48: CL_RGBA, CL_SNORM_INT16
    Format 49: CL_RGBA, CL_SIGNED_INT8
    Format 50: CL_RGBA, CL_SIGNED_INT16
    Format 51: CL_RGBA, CL_SIGNED_INT32
    Format 52: CL_RGBA, CL_UNSIGNED_INT8
    Format 53: CL_RGBA, CL_UNSIGNED_INT16
    Format 54: CL_RGBA, CL_UNSIGNED_INT32
    Format 55: CL_BGRA, CL_UNORM_INT8
    Format 56: CL_BGRA, CL_SIGNED_INT8
    Format 57: CL_BGRA, CL_UNSIGNED_INT8
    Format 58: CL_ARGB, CL_UNORM_INT8
    Format 59: CL_ARGB, CL_SIGNED_INT8
    Format 60: CL_ARGB, CL_UNSIGNED_INT8
    Format 61: CL_INTENSITY, CL_FLOAT
    Format 62: CL_INTENSITY, CL_HALF_FLOAT
    Format 63: CL_INTENSITY, CL_UNORM_INT8
    Format 64: CL_INTENSITY, CL_UNORM_INT16
    Format 65: CL_INTENSITY, CL_SNORM_INT16
    Format 66: CL_LUMINANCE, CL_FLOAT
    Format 67: CL_LUMINANCE, CL_HALF_FLOAT
    Format 68: CL_LUMINANCE, CL_UNORM_INT8
    Format 69: CL_LUMINANCE, CL_UNORM_INT16
    Format 70: CL_LUMINANCE, CL_SNORM_INT16

    AMD:

    Format 0: CL_RGBA, CL_UNORM_INT8
    Format 1: CL_RGBA, CL_UNORM_INT16
    Format 2: CL_RGBA, CL_SIGNED_INT8
    Format 3: CL_RGBA, CL_SIGNED_INT16
    Format 4: CL_RGBA, CL_SIGNED_INT32
    Format 5: CL_RGBA, CL_UNSIGNED_INT8
    Format 6: CL_RGBA, CL_UNSIGNED_INT16
    Format 7: CL_RGBA, CL_UNSIGNED_INT32
    Format 8: CL_RGBA, CL_HALF_FLOAT
    Format 9: CL_RGBA, CL_FLOAT
    Format 10: CL_BGRA, CL_UNORM_INT8

    OCL-GL interop I don't know:
    for Nvidia is either the 70 above or the CUDA-GL supported formats or the GL equivalent of CUDA interop.. I suspect the CL_RGB ones supported..
    for AMD either the CL image ones or CAL DX interop ones
    I suspect RGB formats
    AMD CAL:

    CAL has textures exposed and CAL DX interop would be good to explore..

    Read More
    Posted in | No comments
    Newer Posts Older Posts Home
    Subscribe to: Posts (Atom)

    Popular Posts

    • Porting CUDA to OpenCL!
      Well so you want to port CUDA code to OpenCL: you are in AMD GPU competition of porting Cuda codes to opencl (see previous post) or you are ...
    • Megapost!
      Today fools{ *GTX 485 is 512 cores 3gbytes gddr5 and 850/1750 shaders.. *ati 5990 has 4 gpus in board.. *bulldozer benchmarks }end fools.. A...
    • About ATI and Nvidia drivers (OCL included)!
      Hi I have been investigating AMD and Nvidia drivers.. for 10.3 there are 3d hooks support for 120hz monitors but is d3d9 d3d10 or d3d11 enab...
    • things found in CUDA forums
      Also some CUDA news: Mandelbulb stereo angalyph -> have to port to 3D Vision http://forums.nvidia.com/index.php?showtopic=150985&st=2...
    • opencl/opengl linux interop! seen in opencl cuda 3.0 sdk samples
      Following my OpenCL/OpenGL Window interop work: now has come to Linux  for Nvidia GPU computing registered developers via 195.17 driver! Als...
    • State of the blog..
      Sorry for the delay guys of posting code of Apple OpenCL demos port.. the blog has been with no updated for more than 2 weeks in this rapid ...
    • Optix and OpenCL SDKs with Visual Studio 2010
      Optix 1.0 ========= install cg download Cmake 2.80 cmake says error dumpbin not found and it is cuda doesn't work with vc2010 so copy pt...
    • CUDA 3.0 forums stuff!
      1.Getting CUBIN instead of ELF If you need the older text format, you can disable ELF cubins in nvcc.profile by changing "CUBINS_ARE_EL...
    • News from the web!
      Some things learned in AMD forums: 1.Why 3xxx no OpenCL: Compute shader mode is a hardware feature that did not exist in the HD38XX line of ...
    • Shaders: measuring perf, source translation and parsing different languages!
      Hi, I hope to be pretty exhaustive of options for parsing and translating between graphics and compute shaders ( some open source) For DX sh...

    Blog Archive

    • ►  2013 (5)
      • ►  September (1)
      • ►  March (3)
      • ►  February (1)
    • ►  2012 (1)
      • ►  December (1)
    • ▼  2010 (46)
      • ►  July (4)
      • ►  May (1)
      • ►  April (3)
      • ▼  March (9)
        • What's for CUDA 3.1 and OpenGL 3.3/4.1!
        • raw data..
        • What's left in OpenGL 4.0? and more raw info..
        • GPU computing toys!
        • GPGPU Image support!
        • CUDA 3.0 and Nexus in VS 2010, CUDA on FreeBSD 8.0...
        • New in Nvidia 196.75 drivers!
        • GPU computing in a browser, and other news..
        • New findings and questions..
      • ►  February (15)
      • ►  January (14)
    • ►  2009 (125)
      • ►  December (51)
      • ►  November (53)
      • ►  October (21)
    Powered by Blogger.

    About Me

    Unknown
    View my complete profile