March 2010 ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Sunday, 21 March 2010

What's for CUDA 3.1 and OpenGL 3.3/4.1!

Posted on 12:37 by Unknown

Let's see CUDA 3.0 vs beta:

*adds full blas support
*opencl local atomics
*ocl i cuda d3d9-11 interop..
*updated guides since beta..
still no ptx 1.5,2.0 specs..
also nv-cl extensions published now: http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/opencl_extensions/cl_nv_compiler_options.txt

Interesting notes.

*Float16 (half) textures are supported in the runtime

*cublas complete i ieee754 complaint fermi

*SGEMM performance on Fermi-based GPU is 30% lower than expected.

It will be fixed in 3.1.

*The stability of the large-prime FFT transform (signals with a length

that is prime and >64k samples) is extremely variable, giving single-

precision accuracy in the range 0.005->0.025. In general, smaller signals

experience greater accuracy.

*This package will work MAC OSX running 32/64-bit.

* CUDA applications built in 32/64-bit (CUDA Driver API) is supported.

* CUDA applications built as 32-bit (CUDA Runtime API) is supported.

(10.5.x Leopard and 10.6 SnowLeopard)

Note: x86_64 is not currently working for Leopoard or SnowLeopard

*CUDA applications built with the CUDA driver API can run as either 32/64-bit applications.

* CUDA applications using CUDA Runtime APIs can only be built on 32-bit applications.

SDK Release 3.0 Final:

* Replaced 3dfd sample with FDTD3d (Finite Difference sample has been updated)

* Added support for Fermi Architecture (Compute 2.0 profile) to the SDK samples

* Updated Graphics/CUDA samples to use the new unified graphics interop

* Several samples with Device Emulation have been removed. Device Emulation is

deprecated for CUDA 3.0, and will be removed with CUDA 3.1.

* Added new samples:

concurrentKernels (Fermi Capability)

* Bug Fixes

have added simplempi also..
have to test with intel mpi 4.0

MAC notes:
cuda.dylib is 64bit and has 195API and 195 185 dylibs versioned as 195_96 or 185_55..
*has cuda-memcheck but no cuda-gdb
*cuda kext is fatbin with 64 bits and also cuda.dylib so cuda driver applications are compatible with 64 bits
and compilable..
note also can boot in 64 bit kernel due to kext..
cudart 32 bit
then we can in theory program a cudart wrapper over cuda driver and compile in 64 bits more
now than cudart is stateless and has interop with cuda driver mem alloc..

all needed is cublas and cufft to be 64 bits compile in that..

we have code for cudpp,thrust and cusp and in the meanwhile volkov matmul,fft and lapack codes

so all these can be compiled with 64 bits if we had a cudart 64 bit and see what's up..
well I have compiled cudadevicedrv and matmuldrv
(i'm the first in the world to have 64 bit cuda apple binaries? excepting at nvidia..?)
I have get rid of cutil though compiling to 64 bits would be no problem some notes:
nvcc on mac defaults to 32 bits vs gcc defaults on 64 bits on Snow leopard..
so for using 64bits you must use -m64 in nvcc..
but for cuda driver projects nvcc is of no use since you can use g++ for cuda driver api and compile cuda
files to ptx with nvcc -ptx

if you use nvcc with -m64 you get both cpu 64 bit code but also using -ptx you get ptx code

using 64 bit pointers for Fermi?

so you can use 32 bit pointers in Fermi is better use 32 bit pointers..
so matrixmuldrv use nvcc -ptx for 32bit pointers and use g++ (-m64) and you get
but cudamoduleloaddataex i get error
CUDA_ERROR_POINTER_IS_64BIT = 800, ///< Attempted to retrieve 64-bit pointer via 32-bit API function
loading ptx either if I use a nvcc -m64 or nvcc (all with -ptx) get this error..
so ptx with 32 or 64 bit pointers doesn't change that..
I have to compare files with 32 and 64 bit pointers to see differences also with sm_20..
also note for nvcc -m64 to work either if it not needed needs /usr/local/cuda/lib64 to exist..
so I have copied lib->lib64 or do a symlink..
so you can now run it..
I have to write tutorial of using cuda and nvcc and achieving macos fat binaries(i386 ad 64)
*I see nvcuvid library for mac in gpu computing sdk.. only 32 bits..
/C/common/lib
and /C/common/inc/cuvid

Anyway I have a libcuvid (vs libnvcuvid) for 64 bits /usr/local/cuda (where i have get from?)

*also a pref pane control panel with autoupdate and shows gpu driver version and cuda driver version..

note opencl samples on mac no work until 10.6.3..

good is opencl not definided behavior (implementation specific) for nvidia:

http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_OpenCL_ImplementationNotes_3.0.txt

issues with mac..

opengl 4.1/3.3 perfect release:

*ext_direct_state_access

*ext_separate_shader_objects

*RW textures (3d also) ext_image_load_store

*binary shaders (gl es 2.0 api)

in theory you can use some ir from 3dlabs frontend compiler source..

or also translate to hlsl via som translator (amd hlsl2glsl?) and then use binary hlsl shader..

also a good translator..

http://code.google.com/p/angleproject/

has flex/bison glsl parser and also a glsl2hlsl translator (es 2.0)..

going from binary to dx il via:

fxc /dumpbin

but dx il to binary? also how from dx il->hlsl or glsl directly..

I also have found wine handles/parses more or less dxbc files..

/dlls/d3d10/effect.c

static HRESULT parse_shade

NV OGL extensions:

*fermi fuction pointers and recursion for glsl?

would be good addition to bindless extensions and shader buffer load..

CUDA 3.1:
*cuda-gdb OpenCL HW debugging support..
*pinned GPU mem interop with MPI Infiniband.. (spring10 in sc09)

*template for a DirectCompute project
Currently there is no template for a DirectCompute project, but NVIDIA will be
providing one soon.
*Fix perf of CUBLAS SGEMM by 30% faster on Fermi
*Fix CUFFT perf vs 3.0beta goes 180-190gflops to 150gflops
*provide official cudaasm/decuda or documentation about cubin/ELF format for SM_20 devices? also for sm_10?
*PTX 1.5, 2.0 docs?
*Updated opencl best practices for Fermi? cuda best.. guide is updated but for Fermi?

*Surface functions: RW textures with x,w addressing etc.. also 3d image writes.. headers and exported functions in beta but removed in final..

Also CUDA to CPU compiler or is gpuocelot mature enough and also mac and windows ports avaiable..
would be good a direct PTX2CPU code conversor and using gpuocelto lib as cudart and cuda api..

Mac

*add cuda-gdb (with ocl also) and OpenCL visual profiler

opencl mac no xutan 2 ejemplos

cuda opengl slow mac

ship

is going to work with fermi cuda.kext

*Related is first 195 series 197 whql driver for Quadros enabling OpenCL on these devices..

Adds support for CUDA 3.0 for improved performance in GPU Computing applications. See CUDA for more details.
This driver resolves fan speed issues reported with version 196.75 drivers.
Adds support for the Open Computing Language (OpenCL) 1.0 in Quadro FX Series x700 and newer as well as the FX4600 and FX5600.

*Nvidia mentions compute cluster driver but is 196.28 not updated since early feb.. anyway d3d interop
added finally is not nedeed here..
*
to pierre boudier you cansee ogl 4.0 drivers soon and also a image write and random access extension soon ala d3d11 rwtexture..
ubuntu 10.4 fglrx 8.72

fglrx-installer (2:8.721-0ubuntu1) lucid; urgency=low

* New upstream release:
- Restore compatibility with kernel 2.6.32 and xserver 1.7 (LP: #494699).
- Add Passive Stereo support on workstation (FireGL/FirePro) hardware.
- Add Eyefinity support (more than 2 monitors on Radeon HD 5xxx hardware).
Officially WS-only but should work on consumer boards as well.

GL_EXT_shader_subroutine GL_EXT_timer_query

Also what about 3d stereo on linux:
*3d vision for opengl qb on quadro with stereo connector is here..
*a 3dtv for linux so opengl qb can be output to hdmi 1.4 on linux? this can add working on low profile quadros as stereo connector is not needed (is not needed in 3d vision is Nvidia way of artificially limiting to super high end quadros well expect perhaps better synch..)
also if they add VDPAU h.264 MVC and you decrypt bluray3d with anydvd hd you will be able in theory to see it in linux gpu accelerated decoding and sending to tv's via hdmi 1.4..
let's see also how windows is handled as not dxva 2.0 support it mvc? also not cuvid so leet's see if they add it to cuvid also..
so seems all cyberlink will get some library by nvidia or what?
*ATI has hooks for d3d9,10? d3d11? in 10.3, also fglrx 8.72 add passive stereo for ogl qb (active stereo is here right?.. but for 120hz lcds also?)
let's see also how ati manages output to HDMI 1.4 tv's via either IZ3D partnership or what? in fact I expect iz3d only hooks d3d stereo and the amd will add some HDMI 1.4 stereo from this hooks so will be good a sdk or documentation of this hooks..
Also Nvidia will be good publishing stereo sdk (promised in gdc2010) and hope also this hooks (d3d9-11) will work with 3dtv and output to hdmi 1.4 tvs.. In fact yes as Avatar and 3d stereo vision use this hooks presumably..
mac is out in this scope..

also nvidia can be late with fermi but not with software supporting it..
now d3d11 is with cs5.0 here and also we have now d3d11 interop for cuda in 3.0 and d3d11 interop with opencl extension and also optix d3d11 interop..
We have d3d11 interop with:
*CUDA 3.0
*OpenCL
*Optix
HW debugging:
Nsight.
All need to be released is nsight which will also bring d3d11 support (hw debug and profile) wii be good to hw debug cuda, d3d11 cs, cuda with d3d11 interop, and trace opencl and opengl (4.0? will be traced?)..

also cg 3.0 will have support for d3d11? and also sm5.0 opengl 4.0 support? i.e. tesselation shaders with glsl output?
note cgc 3.0 is shipping on tegra sdk and also as part of nvidia drivers 195 opengl compiler..
I have seen cgfx working with optix and cuda in a blog so hope they ship example soon..
http://lorachnroll.blogspot.com/2010/03/mixing-nvidia-technologies-thanks-to.html

GPU: GF100 @ 700MHz
- CUDA cores: 480 @ 1401MHz
- Memory: 1536MB GDDR5 @ 1848MHz 384-bit
- TDP: 250W
GeForce GTX 470:
- GPU: GF100 @ 607MHz
- CUDA cores: 448 @ 1215MHz
- Memory: 1280MB GDDR5 @ 1674MHz 320-bit
- TDP: 225W
- Price: $349US

- 3D APIs: OpenGL 4.0 and Direct3D 11
- GPU Computing: OpenCL, CUDA and DirectCompute
- 3-way SLI support

GeForce GTX 480 : 480 SP, 700/1401/1848MHz core/shader/mem, 384-bit, 1536MB, 250W TDP, US$499

GeForce GTX 470 : 448 SP, 607/1215/1674MHz core/shader/mem, 320-bit, 1280MB, 225W TDP, US$349

Note also we have like GLSL and OCL vec4 and other C++ libraries:
*GLM has GLSL strict compliance..
even with GMX experimental extensions we have SIMD implementations..
*DX SDK feb 2010 has XNAMATH 2.02 SIMD math library
also read:
http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php

HDR good maps:

http://www.hdrlabs.com/sibl/archive.html

Nvidia employess blogs:

http://timothylottes.blogspot.com/
http://jamesdolan.blogspot.com/
http://industrialarithmetic.blogspot.com/
http://castano.ludicon.com/blog/

http://twitter.com/castano

http://twitter.com/tmurray_cmpxchg

showing max cuda mem:

http://forums.nvidia.com/index.php?showtopic=102682 cuda maxmem

caustics patents:

US patent applications: 20090096788, 20090096789, and especially 20090128562,

The LLVM 2.7 binaries are available for testing:

http://llvm.org/pre-releases/2.7/pre-release1/

http://amnoid.de/tmp/clangtut/tut.html

http://lists.cs.uiuc.edu/pipermail/cfe-dev/2009-May/005167.html

http://synopsis.fresco.org/

Performance inconsistencies when testing various bit-counting methods

ubuntu cheat cube:119834-cheat-cube-ub

ie9 VML to SVG Migration Guide

windows phone 7:

*xna ctp 4.0 avaiable works with pc but only reach profile not hidef..

*unlocked image with all apps instructions on a blog..

*petzold samples and book excerpt avaiable..

*also sqlite port ->csharp-sqlite.wp

Windows 7 XP Mode now has support for CPUs without virtualization VT-D support..

Windows 7 SP1 virtualization news:

With Microsoft RemoteFX, users will be able to work remotely in a Windows Aero desktop environment, watch full-motion video, enjoy Silverlight animations, and run 3D applications," Microsoft's Max Herrmann writes, "All with the fidelity of a local-like performance when connecting over the LAN."

cuda will work with it? i.e. no need for compute cluster driver and also ogl,dx and interop support..

Q: Will RemoteFx support also OpenGL hardware acceleration which is the 3D high level API used by professional applications like CAD systems or medical applications ?

A: RemoteFX will support certain OpenGL applications. However, as the development of RemoteFX is still ongoing, it is too early to provide any specifics at this point.
Q: Are you plan to introduce RemoteFX also for Windows 7 because their are many scenarios where the remote system is not a server but a high end workstation ?
A: RemoteFX has been designed as a Windows Server capability to support the growing demand for multi-user, media-rich centralized desktop environments. Windows 7 will be supported as a virtual guest OS under Hyper-V.

Dynamic Memory is an improvement to Hyper-V which allows users to pool all available physical host memory together, and dynamically allocate it to virtual machines. In other words, if the workload changes, VMs can get access to extra memory without having to shut them down.

XNA forums:

Updated list of D3D12 suggestions

Unable to perform a recursive call with DirectCompute?

How to AttachBuffersAndPrecompute to ID3DX11FFT

RWStructuredBuffer counter

The IncrementCounter is faster than IterlockedAdd(Buffer[0], 1) in 4 times.

Gamefest 2010 presentations?

D3D11 / D2D Interoperativity

329M pairs/sec radix sort performance, 408M keys/sec - crushes CUDPP numbers

AppendStructuredBuffer driver bug?

How to debug DirectX 11 Compute Shaders?

Creating a Shared Surface with DXGI

atomic
I have some questions about RWStructuredBuffer:
1. How to copy hidden counter to system memory? CopyStructureCount
2. How to reset the counter to zero? last argument of OMSetRenderTargetsAndUnorderedAccessViews
3. Why the performance of this counter is much more than the performance of InterlockedAdd at the element buffer? (HD 5670)
The IncrementCounter is faster than IterlockedAdd(Buffer[0], 1) in 4 times.
How to AttachBuffersAndPrecompute to ID3DX11FFT?

http://gephi.org/

http://forums.xna.com/forums/t/49607.aspx
Thank you. I forgot about debug version of the D3DCSX. Debug message proved to be helpful. For the record: 1. The number of buffers attached must be exactly the same as in D3DX11_FFT_BUFFER_INFO. 2. The views MUST be created with the D3D11_BUFFER_UAV_FLAG_RAW flag (although it wasn't mentioned in documentation).

The Chrome dev channel release has support for an Open GL ES 2.0 interface

for Native Client. This is something we said we would do sometime last year.

When we consider it stable, documented etc. we will do more of an

announcement.

Google are announcing that NaCl now also supports x86-64 and ARM.

http://www.osnews.com/story/23021/Native_Client_Portability_Almost_Native_Graphics_Layer_Engine

NaCl_SFI:Adapting Software Fault Isolation to Contemporary CPU

Architectures

pnacl: Portable Native Client Executables

from GDC:

this are also graphics API translations:

Cider & Cedega: Direct3D on OpenGL

GameTree.tv: Direct3D on OpenGL ES

SwiftShader: DX Software Rendering (also WARP)

ANGLE Project: WebGL (OGL ES 2.0) on Direct3D

now we need GPGPU apis so:

cuda on opencl?

cuda on cal?

directcompute on opencl?

opencl on directcompute?

posted on opengl and cuda forums:

Questions to nvidia:

*Is Nvidia going to expose ext_gpu_shader_fp64 on GT2xx hardware with double precision or is for d3d11 hardware?

For example gtx275

AMD seems to support double precision on GLSL via doublepAMD even on 4850 cards..

Also is Nvidia with initial GL 4.0 drivers going to finally expose documentation for wgl_nv_dx_interop and have the shown at gtc texture writting and random access support?

via ext_image_load_store?

Please post PTX 1.5 and 2.0 documents..

Also I'm summing here things promised soon by Nvidia so let's see how much it takes before we get:

*cuda-gdb support for hardware debugging of OpenCL kernels

*cuda-gdb GPU debugger for Mac (with OpenCL support also)

Mac related:

Is mac 64 supported?

This package will work MAC OSX running 32/64-bit.

CUDA applications built in 32/64-bit (CUDA Driver API) is supported.

CUDA applications built as 32-bit (CUDA Runtime API) is supported.

(10.5.x Leopard and 10.6 SnowLeopard)

Note: x86_64 is not currently working for Leopoard or SnowLeopard

UDA applications built with the CUDA driver API can run as either 32/64-bit applications.

CUDA applications using CUDA Runtime APIs can only be built on 32-bit applications.

My mac notes:

nvcc matrixMul_kernel.cu matrixMulDrv.cpp -I../../common/inc/ ../../lib/libcutil_i386.a matrixMul_gold.cpp -Xlinker /usr/local/cuda/lib/libcuda.dylib

nvcc matrixMul_kernel.cu -c -m64

g++ matrixMul_gold.cpp matrixMulDrv.cpp -I../../common/inc/ -I$CUDA_INC_PATH -L$CUDA_LIB_PATH /usr/local/cuda/lib/libcuda.dylib ../../lib/libcutil_i386.a

para nvcc -m64 crea lib64 con copia de lib

nvcc -m64 deviceQueryDrv.cpp -I../../common/inc/ -I../../../shared/inc -Xlinker /usr/local/cuda/lib/libcuda.dylib

quita cut

nvcc defaults 32 bits

gcc defaults 64

g++

g++ deviceQueryDrv.cpp -I../../common/inc/ -I../../../shared/inc /usr/local/cuda/lib/libcuda.dylib -I$CUDA_INC_PATH

//#include

#define CU_SAFE_CALL_NO_SYNC(a) a

//CUT_EXIT(argc, argv);

export CUDA_BIN_PATH=/usr/local/cuda/bin

export CUDA_LIB_PATH=/usr/local/cuda/lib

export CUDA_INC_PATH=/usr/local/cuda/include

export PATH=$PATH:/usr/local/cuda/bin

Posted in | No comments

Thursday, 18 March 2010

raw data..

Posted on 11:32 by Unknown

games:
*metro 2033 and just cause 2 demo avaiable! (fermi launch titles?)
*assasins creed2 and bad company 2 this month also..
*Command & Conquer 4: Tiberian Twilight
*3d vision cd 1.23 has direct3d11 support! (so list support for d3d11 fermi supersleddemo)

iexplore 9 preview with direct2d directwrite support

*3D texture based separable convolution, extension of SDK example
code:
http://forums.nvidia.com/index.php?showtopic=163382

*bin format for fermi is similar ptx: post luebke on gpgpu-sim mailing list
one guy from pathscale says he has all info on this and other low level info presumably PTX 1.5,2.0 specs (bin format spec?) and also info for open source cuda driver for BSD etc..?
*gpgpu-3 papers avaiable!
http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-FinalProgram.pdf
*CULA 1.2 avaiable with some eigenvectors/values stuff..

*"GPU Sample Sort" paper for the upcoming IPDPS 2010 conference?

It is possible to achieve much higher sorting rates for NV devices than with the Satish/CUDPP methods. You might be interested in our radix CUDA sorting results here at UVA. We demonstrate 480M pairs/sec, and 550M keys/sec on our GTX285 (with other devices evaluated as well). Interestingly enough, our keys-only results on the NV GT200 architecture are superior to the cycle-accurate sorting results from the (defunct) 32-core Larrabee.

Where is source?

http://www.cs.virginia.edu/~dgm4d/papers/RadixSortTR.pdf

Other sorting new papers:
*Revisiting Sorting for GPGPU Stream Architectures
"GPU Sample Sort" paper for the upcoming IPDPS 2010
N. Leischner, V. Osipov, and P. Sanders. GPU sample sort. In Proc. Int'l Parallel and Distributed Processing Symposium (IPDPS), to appear, 2010 (currently available at http://arxiv1.library.cornell.edu/abs/0909.5649).

*CUFFT does support streams... and seems has 3d ffts perf improvements of sc08 paper included so
apple fft code seems now work on Nvidia OpenCL but offer 2x-3x perf disadvantage vs cufft..

2d to 3d video conversion:
we have reald and other directshow plugin..
now:
arsoft sim 3d plus hd coming q2..
and powerdvd 10..
*TrueTheater™ Stabilizer
*TrueTheater™ 3D
*TrueTheater Noise Reduction

PowerDVD 10 Mark II: Consumers who purchase PowerDVD 10 Ultra 3D will receive a FREE UPGRADE that enables support of the Blu-ray 3D format and 2D to 3D conversion of video files. Available this summer.
Blu-ray 3D playback requires FREE "Mark II" upgrade which will be available soon.

lot of betas coming:
qt 4.7
intel compiler 12
vmware workstation 7.1
other march:
openrl
heaven 2.0

http://www.cs.utk.edu/~dongarra/WEB-PAGES/cscads-libtune-09/
1st CUDA Developers' Conference
http://www.smithinst.ac.uk/Events/CUDA2009
see
"Looking after the 7 dwarfs: numerical libraries / frameworks for GPUs" Mike Giles
also
"The Art of Performance Tuning for CUDA and Manycore Architectures"
David Tarjan (NVIDIA)
Kevin Skadron (U. Virginia)
Paulius Micikevicius (NVIDIA)

cudpp 1.1.1 svn has fermi support
cusp has amg geometric multigrid..
http://forums.nvidia.com/index.php?showtopic=163382&st=0&#entry1022104

See DirectX 9.0 on OpenGL ES 2.0 ->http://www.gametree.tv/ linux sdk

Coming in Spring 2010, the GameTree.tv Publishing SDK for Intel CE hardware will include the tools you need to optimize and debug your game for the GameTree.tv Gaming Platform, plus the ability to order Intel CE hardware.Developer Tools & Documentation     available     available
OpenGL ES 1.1 and 2.0
- Windows Game Development and Emulation
- Linux Desktop Runtime SDK     available     available
Direct3D® support
- Fixed-Function
- Shader Model 1.0 and 2.0 API
- Linux Desktop Emulation SDK     available     available
Debugging With Visual Studio     Coming March 2010     available
GameTree.tv Developer Forums     Coming Soon     available
Publish Games For Commercial Sale
Detailed Hardware Setup Documentation
Hardware Order Process
Developer Relations Support

fglrx 8.72.5 has ubuntu 10.4 support and opengl 3.2.97xx (opengl 3.3/40 partial support?)

Nvidia theater GDC notes:
dmm2
dmm2 free 1500 objects (star unleashed not uses more) max, has interop with physx and bullet adds
also directcompute and opencl simulation

shipping september october beta
still not ready plastic simulation and fracture mode.. calculates stress on volume so physical based break..
uses fp32 for gpu support and sse..

3d vision on unreal engine 3 shipping in april..
3d vision sdk soon code samples etc developer tricks for surround
surround recommends gfx400 in sli i "release 256 driver"

khrnos gdc sessions published has
info physics amd opencl sph and soft bodies no rigid bodies this is bullet work..
also fem simulation is dmm2 work..
no more interesting talk slides?: fft profiling for OpenCL by Nvidia employee

physxlab with destruction (precalculated) is beta now with unreal engine 3 integration

new unigine 2.0 this month on 26 has Linux support? and Windows OpenGL tesselation support with Fermi /5xxx cards?
nsight 480gtx 8marzo release
nexus 1.0 opengl and opencl analyzer not hardware debugger but like gdebugger gl+cl

fermi games just cause 2 (d3d10 only) metro 2033 (d3d11 optional)
http://nvidia.fullviewmedia.com/gdc2010/agenda.html
opengl 4.0 extensions viewer and glew in trunk support!
assasins cred2, badcompany 2
ati open 3d
nvidia 3dtv

cuda and visual studio:

QUOTE

- create empty cuda projects trough "project.."

You can just create an ordinary console project and then add .cu files to this project (see next point).

QUOTE

- add new .cu files through "add new item" (renaming c++ or txt in .cu files causes build errors)

If you add the CUDA build rules (Cuda.rules, distributed with the SDK) then VS will automatically detect the .cu files and pass them to nvcc to compile these to standard .obj files, the standard linker (link.exe) will then link these with the rest of your application's .obj files.

QUOTE

- doesn't highlight code in .cu files

See the instructions in (SDK_INSTALL_DIR)\C\doc\syntax_highlighting\visual_studio_8

QUOTE

- must copy a thousand times cutil64.dll around till it releases the program ...

Cutil is used to minimise code replication between the SDK samples. I'd advise understanding what you actually need and implementing it yourself. For example, most people only want the cuda safe call macros and you would be better off handling the error in a manner suitable for your app rather than just calling exit().

QUOTE

- must add a "thousand" new libraries not to cause build errors

By "thousand" do you mean one (cudart.lib)?! Ok, so you're using cutil so you need cutil64.lib too. But by definition using any library (and the CUDA API is provided through a library) you have to link with libraries.

QUOTE

- and even then its not sure if it runs

Can't help with that one (without more info).

I would advise the following.

Preparation:

Set up syntax highlighting
Set up Intellisense

Development:

Create a new, empty, console project (or you can use an existing project if you have one
Add your .c, .cpp and .cu files
Add the Cuda.rules
Modify C/C++ code generation to use /MT in release, /MTd in debug
Do the same for the Cuda code generation
Add cudart.lib to all configurations (i.e. release and debug)
Build, run, debug etc.

Proceedings of 24th IEEE International Parallel and Distributed Processing Symposium

gpu papers:

Session 2: Scientific Computing with GPUs Improving Numerical Reproducibility and Stability in Large-Scale Numerical Simulations on GPUs
Implementing the Himeno Benchmark with CUDA on GPU Clusters
Direct Self-Consistent Field Computations on GPU ClustersParallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs

A High-Performance Fault-Tolerant Software Framework for Memory on Commodity GPUs

Sort
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs
GPU Sample Sort
Highly Scalable Parallel Sorting

Session 9: Software Support for Using GPUs 26
Object-Oriented Stream Programming using Aspects
Optimal Loop Unrolling For GPGPU Programs
Speculative Execution on Multi-GPU Systems
Dynamic Load Balancing on Single- and Multi-GPU Systems

Fisheye Lens Distortion Correction on Multicore and Hardware Accelerator Platforms .. . . 37
Large-Scale Multi-Dimensional Document Clustering on GPU Clusters

Dynamically Tuned Push-Relabel Algorithm for the Maximum Flow Problem on CPU-GPU-Hybrid Platforms .

Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA

Inter-Block GPU Communication via Fast Barrier Synchronization

Posted in | No comments

What's left in OpenGL 4.0? and more raw info..

Posted on 11:01 by Unknown

Somedays ago OGL 3.3 and 4.0 specs were published and a set of equivalent ARB extensions were put on registry where GLSL 3.3 and 4.0 were released.. now ogl 4.0 compatibility spec is +600 pages long core is 420 pages..

Other things:

*OGL 4.0 quick reference card

http://www.khronos.org/files/opengl4-quick-reference-card.pdf

*new glext.h and gl3.h updated
*glloader, glew on svn, and opengl extensions viewer for 3.3/4.0 already support it..
wait for sdl, smfl ..
*Waiting Fermi drivers on launch day..

remember all ARB extension no vendor or EXT..

No arb extensions included in 3.3/4.0 spec are:

GL_ARB_shading_language_include
GL_ARB_texture_compression_bptc

so HDR D3d11 texture format not required for ogl 4.0..

also lost is #include in shaders..

5xxx series include ogl4.0 emulating double on cpu? better with double-float emulation..

Last Nvidia found:

GL_EXT_shader_image_load_store

GL_EXT_vertex_attrib_64bit

and amd:

GL_EXT_shader_atomic_counters

are not found..

AMD 10.3 includes also first extension blend_func_extended..

GL_EXT_vertex_attrib_64bit adds vertex attribs:

so now fp64 is only for uniforms and passing not vertex attribs

remember no double rendertargets tex formats simlar to d3d11..

GL_EXT_shader_image_load_store allow write to random access to texes RWtexture3d
amd has amdx_random_access_target

ARB_blend_func_extended is called dual source blending in DX10, but got dropped in DX11..

We have tesselation shaders, dynamic shader linkage and compute interop with OCL..

still lacking vs d3d11 is:
*multi-threaded rendering:
remember only creation of resources in current drivers.. no parralel command list creation
is driver or hardware issue?

*random access load/store/atomic to texes->GL_EXT_shader_image_load_store amdx_random_access_target+GL_EXT_shader_atomic_counters RWtexture3d
*lacking atomic access to texs and mem barriers in fragment shaders: DeviceMemoryBarrier in d3d11
*GL_AMD_conservative_depth adds:

Conservative oDepth - This algorithm allows a pixel shader to compare the per-pixel depth value of the pixel shader with that in the rasterizer. The result enables early depth culling operations while maintaining the ability to output oDepth from a pixel shader.

So people on OGL forums are criticizing lack of:
*multi-threaded rendering
*shader binaries for avoid compilation preferibly crossvendor and plaform as DX IL DXBC (which is almost 100% compatible with ATI's IL)

*direct state access
* Epic fail for GL_ARB_sampler_objects as no glsl support..

I lack:

*ext_separate_shader_objects

The ability to separate program objects is only going to become increasingly more relevant.

*nv_texture_barrier

crossprocess texture sharing?

Support for programmable offsets in gather is there see 2x speedup in Fermi whitepaper and tesselation

fermi test would be good

fermi:

196.78 drivers support fermi..

full support for OGL 4.0 in fermi launch..

stocasthic transpareny i3d 2010 has fermi perf on this algorithm via ogl sample_shading 10.1 extension

GLwgl_dx_interop

GL_NVX_gpu_memory_info

GL_NV_gpu_program4_1

published then?

try openrl with opencl on fermi..

opencl drivers at fermi launch will have:

1.cuda 3.0 final

Fermi Direct3D 11 interoperability

Fermi HW Profiler support in OpenCL Visual Profiler

Complete BLAS lib, now with complex routines

cuda-gdb support for JIT compiled kernels

add

C++ Class Inheritance

C++ Template Inheritance

Unified interoperability API for Direct3D and OpenGL

OpenGL texture interoperability

2 with new opencl driver support:

*pragma unroll

*local atomics

*icd final

*d3d9/10/11 support

fxc interface has interface support but functions inside it how are called

see "CUDA_Developer_Guide_for_Optimus_Platforms"

http://www.stumblingahead.com/blog/?p=66 talking about tesselation soon..

2010 conferences GPU papers:

*PPOP

*GDC 2010

*I3D 2010

*GPGPU-3

*ASPLOS

MacroSS: Macro-SIMDization of Streaming Applications,

COMPASS: A Programmable Data Prefetcher Using Idle GPU Shaders,

"Investigating the Impact of Code Generation on Performance Characteristics of Integer Programs."

EUROGRAPHICS 2010

SIGGRAPH 2010

Interesting new/coming books:

*Game Programming Gems 8

*gpu computing gems 2010?

*Game Engine Gems 1, Volume One

*Programming Massively Parallel Processors: A Hands-

*GPU Pro: Advanced Rendering Techniques

*Multigrid Methods on GPUs

*Game Coding Complete, Third Edition

*Video Game Optimization

*Game Engine Architecture

*Real-Time Cameras

Programming Game AI by Example

Comments:

GL_ARB_shading_language_include-> glsl acepta #include i compilarshaderincludepaths fija <> paths de busqueda

GL_ARB_texture_compression_bptc

textures d3d 11 -> compressor incluido mejor offline

GL_ARB_blend_func_extended

permite usar dos salidas de fragment shader como color in i blend factors

mira ejemplo ventana color reflectiva en un paso usando con rops

GL_ARB_explicit_attrib_location->

fija en glsl explicito como las variables entre shaders se pasan e

GL_ARB_occlusion_query2

permite una boleana para si algo pasa o no

GL_ARB_sampler_objects

BindSampler( uint unit, uint sampler );

When a sampler object is bound to a texture unit, its state supersedes that

of the texture object bound to that texture unit. If the sampler name zero

is bound to a texture unit, the currently bound texture's sampler state

becomes active. A single sampler object may be bound to multiple texture

units simultaneously.

no cambia glsl a hlsl con tex.sampler

GL_ARB_shader_bit_encoding

con esto puedo usar fast float to int de spap paper kun zhou que coge bits

de float i haciendo cosas consige abs, float2int de valor ,etc..

To obtain signed or unsigned integer values holding the encoding of a

floating-point value, use:

genIType floatBitsToInt(genType value);

genUType floatBitsToUint(genType value);

Conversions are done on a component-by-component basis.

GL_ARB_texture_rgb10_a2ui

GL_ARB_texture_swizzle

GL_ARB_timer_query

GL_ARB_vertex_type_2_10_10_10_rev

GL_ARB_draw_indirect

compute interop

void DrawArraysIndirect(enum mode, const void *indirect);

nuevo buffer object

DRAW_INDIRECT_BUFFER

que hay bindeao

se usa como datos del num elementos etc..

que no

pues el puntero indirect se usa?..

GL_ARB_gpu_shader5

GL_ARB_gpu_shader_fp64

Should double-precision fragment shader outputs be supported?

RESOLVED: Not in this extension. Note that we don't have

double-precision framebuffer formats to accept such values.

GL_ARB_shader_subroutine

GL_ARB_tessellation_shader

GL_ARB_texture_buffer_object_rgb32

GL_ARB_transform_feedback2

1.transform feedback objects

2.pause and resume transform feedback

3.ability to draw primitives captured in transform feedback mode without querying the captured

primitive count

DrawTransformFeedback()

GL_ARB_transform_feedback3

unreal 3 news:

*palm webos and iphone support (on mac?)

*3d vision support

http://www.chw.net/2010/02/29-incomodas-preguntas-para-nvidia-sobre-gf100/

AMD Open Physics Initiative Expands Ecosystem with Free DMM for Game Production and Updated version of Bullet Physics

Apple adopts DirectX 11 GPUs, buys AMD Radeon HD 5750

apple news:

*99 dev program

*valve games to mac next month and monkey island 2 se..

*6core macpro next week (12 core?)Mac Pro 'hexacore' Xeon Core i7-980x coming Tuesday

reviews on anandtech 980 gulftown with aes today..

*amd 5750 imac in june? adds opengl 4.0 and ocl full support for mac..

so 10.6.4 will support amd 5xxx

*iphone 4.0 multitasking support

*10.6.3 this month?

CUDA:cuda-gdb gpu support and visual profilers,64 bit and efficient gl interop soon?

http://pasco2010.imag.fr/images/poster_pasco2010.pdf

http://unlimiteddetailtechnology.com/

roxio cienplayer 3d

CLyther = Python + OpenCL

amd open physics (free dmm 2.0 with ocl) and open stereo(qbf stereo for radeon?)

also eyefinity sdk coming soon..

ticker tape avaiable

pgi insider feb 2010 volume

http://www.pgroup.com/lit/articles/insider/v2n1a3.htm

says new fermi support and data region things..

XNA 4.0 winpho 7 tegra2 soon..

Yellow Dog Enterprise Linux for CUDA

http://ydl.net/cuda/iso/YDELforCUDA-6.2-20100302-DVD.iso download free for students

Jenkins Software Announces Data Mining Tool for Game Developers

As a further enhancement, AMD has developed new parallel GPU accelerated implementations of Bullet Physics’ Smoothed Particle Hydrodynamics (SPH) Fluids and Soft Bodies/Cloth. The new code written in OpenCL and Direct Compute will be contributed as open source.

OpenGL usage from an ISV perspective

intel gpa 3.0

nity Announces 3.0 Platform, Support For PS3, iPad, And Android

Valve Confirms Mac Versions Of Steam, Valve Game

http://www.raknet.net/echochamber

Erwin Coumans - SONY - Porting existing code to OpenCL

Ben Gaster AMD and Avi Shapira - Graphic Remedy - Debugging fluid dynamics on OpenCL

Greg Smith - NVIDIA - FFT and OpenCL Profiling

http://www.arm.com/community/software-enablement/google/solution-center-android.php

http://realworldtech.com/forums/index.cfm?action=detail&id=108017&threadid=108017&roomid=2

I can only say that at CAL level (and obviously OpenCL built upon CAL) there are numerous problems with multiple GPUs.

Definitely you're need one thread and one context per each GPU to make it working. But it itsn't enough because almost every CAL function isn't thread safe, thus calling calResMap() (which is the only to get access to local GPU memory) in one thread blocks all other threads/contexts.

And (as I've already wrote at these forums), OpenCL using calCtxWaitForEvent() function instead of CPU burning loop

while (calCtxIsEventDone(calCtx, e) == CAL_RESULT_PENDING);

to wait for GPU kernel completion.

But this calCtxWaitForEvent() also blocks every context currently running. This especially noticeable when there are different devices at system (like 5770+4770). So basically it's simply impossible to asynchronously work with multiple GPUs within single process.

All above things applies to windows version of CAL, never tried linux one.

Yup, and I use 1 thread per GPU too. So 1 thread, 1 context, 1 queue for each GPU. I tried other configurations but they weren't working (i.e. not running in parallel).

Why on HD4870 with 512 MB onboard RAM only 128 available to OpenCL ???

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=128846&enterthread=y

MacroSS: Macro-SIMDization of Streaming Applications,

COMPASS: A Programmable Data Prefetcher Using Idle GPU Shaders,

"Investigating the Impact of Code Generation on Performance Characteristics of Integer Programs."

http://ctk-dev.sourceforge.net/

gmac

http://ctk-dev.sourceforge.net

http://code.google.com/p/fluidic/

http://otoy.com/

http://www.gameenginegems.com/

We're excited to announce a new addition to the Palm® webOS™ development platform: the webOS Plug-in Development Kit (PDK) lets developers extend their webOS applications by writing plug-ins in C or C++. The webOS PDK makes it easy for developers to leverage existing code and exposes new capabilities — including high-performance 3D graphics.

http://code.google.com/p/gyp/source/checkout

Posted in | No comments

Sunday, 7 March 2010

GPU computing toys!

Posted on 09:12 by Unknown

Hi I would like to release some lame but hopefully useful tools:
https://dl.dropbox.com/u/1416327/cld3d.rar

First OCL D3D interop headers and spec for Nvidia and AMD and a tool for checking current status:
the headers are in h
and are for d3d9,10,11 for NV and d3d9,10 for AMD..
#include for every d3d version and call initcld3d() in your code and voila you have the
d3d stuff..
if you #define INCAMD you have even amd functions included and can avoid amd headers..

with these I have complied four exes named cl_xx_interop which check d3d 9,9Ex,10 and 11..
they check extension reporting, try to create a shared context in some ways and then associate a d3d object and textures to ocl and aquire and release it prior to use..

Also cl_d3d10_interop build shows image formats avaiable to OpenCL images see next post..

Testing OCL-D3D11 interop
Checking D3D interop extensions support for platform: NVIDIA Corporation
nv D3D 9 interop extension: Found.
nv D3D 10 interop extension: Found.
nv D3D 11 interop extension: Found.

Using device: GeForce GTX 275
Enabling texture interop checks: image support is supported.
clGetDeviceIDsFromD3D11NV pointer: Found
and it works! (returns d3d associated ocl device)
clCreateFromD3D11BufferNV pointer: Found
clCreateFromD3D11Texture2DNV pointer: Found
clCreateFromD3D11Texture3DNV pointer: Found
clEnqueueAcquireD3D11ObjectsNV pointer: Found
clEnqueueReleaseD3D11ObjectsNV pointer: Found
Testing context creation with
no dev (clCreateContextFromType): OK.
dev info (getdeviceids): OK.
dev info (clGetDeviceIDsFromD3DNV CL_PREFERRED_DEVICES_FOR_D3D9_NV): OK.
Testing clCreateFromD3D11BufferNV: OK.
Testing aquire release stuff: Ok.. releasing it: Ok.
Testing clCreateFromD3D11Texture2DNV: OK.
Testing aquire release stuff: Ok.. releasing it: Ok.
Testing clCreateFromD3D11Texture3DNV: OK.
Testing aquire release stuff: Ok.. releasing it: Ok.

Also I contains a optd3d which displays the four optional d3d11 features (cap bits):

In my gtx 200 displays:

multithreaded comand lists: 0
multithreaded Concurrent Creates: 1
Double precision: 0
Compute Shader: 1

in ATI 5850 displays:

multithreaded comand lists: 0

multithreaded Concurrent Creates: 1

Double precision: 1

Compute Shader: 1

Anyway double prec is not working with loops..

This shows multithreaded command lists are still not supported by ATI (are this supposed to be a implementation issue or a hardware limitation..)

Equal to Nvidia and upcoming Fermi..

I include a CLinfo not mine but for checking CL info..

report.bat create a report.txt with the info of all this executables..

I also include 2dbench for cheking GDI in Windows 7 perf issues.. AMD will fix in Catalyst 10.4..

There is a high efficient matmul for CUDA and AMD cards and peakflops for AMD cards..

%
% compute C = A*B, A:mxk, B:kxn, C:mxn
%
% cubin file = ../method1/decuda_ldsb32_cudasm.cubin
% kernel function = method1_variant_sgemmNN
% use device: GeForce GTX 275
% m=n=k gpu_time (ms) flops (Gflops/s)
   32 0.044 1.391
   128 0.120 32.451
   224 0.194 107.870
   320 0.302 201.802
   416 0.445 301.033
   512 0.619 403.979
   608 1.277 327.914
   704 1.582 410.719
   800 2.618 364.210
   896 3.135 427.439
   992 4.401 413.123
   1088 6.014 398.868
   1184 6.981 442.860
   1280 8.751 446.365
   1376 10.911 444.746
   1472 13.403 443.262
   1568 16.377 438.470
   1664 18.901 454.051
   1760 22.437 452.594
   1856 25.820 461.218
   1952 31.233 443.566
   2048 33.317 480.229
   2144 39.834 460.841
   2240 44.989 465.337
   2336 51.643 459.765
   2432 56.514 474.095
   2528 64.183 468.859
   2624 72.540 463.923
   2720 79.686 470.387
   2816 85.826 484.626
   2912 96.003 479.094
   3008 108.801 465.942
   3104 121.579 458.181
   3200 126.446 482.699
   3296 138.522 481.473
   3392 153.544 473.440
   3488 168.797 468.268
   3584 177.873 482.085
   3680 193.298 480.227
   3776 212.160 472.675
   3872 229.596 470.947
   3968 246.403 472.280
   4064 260.086 480.699

clock 1620

% m=n=k gpu_time (ms) flops (Gflops/s)

32 0.040 1.516

128 0.108 36.044

224 0.173 120.900

320 0.265 229.925

416 0.393 341.338

512 0.535 467.090

608 1.107 378.021

704 1.371 474.163

800 2.270 420.030

896 2.751 486.983

992 3.804 477.992

1088 5.205 460.925

1184 6.003 514.983

1280 7.609 513.393

1376 9.396 516.463

1472 11.555 514.134

1568 14.145 507.666

1664 16.427 522.442

1760 19.387 523.784

1856 22.182 536.854

1952 26.860 515.777

2048 28.642 558.623

2144 34.530 531.627

2240 39.585 528.868

2336 44.440 534.292

2432 49.141 545.226

2528 55.274 544.429

2624 63.241 532.134

2720 68.451 547.592

2816 74.160 560.865

2912 82.945 554.516

3008 94.150 538.449

3104 104.581 532.653

3200 108.907 560.436

3296 119.277 559.158

3392 131.982 550.785

3488 146.003 541.376

3584 154.088 556.502

3680 166.307 558.166

3776 184.523 543.469

3872 198.692 544.196

3968 214.158 543.390

4064 223.720 558.838

it's a cubin so will not work in fermi
5850 stock

flopspeak.exe
Device 0
target 8
localRAM 1024 MB
uncachedRemoteRAM 2047 MB
cachedRemoteRAM 2047 MB
engineClock 725 MHz
memoryClock 1000 MHz
wavefrontSize 64
numberOfSIMD 18
doublePrecision 1
localDataShare 1
globalDataShare 1
globalGPR 1
computeShader 1
memExport 1
pitch_alignment 256
surface_alignment 4096
Device 0: execution time 7913.45 ms, achieved 2041.80 gflops
oc 950mhz

flopspeak.exe

engineClock 950 MHz
memoryClock 1000 MHz

Device 0: execution time 6039.35 ms, achieved 2675.40 gflops

matmul.exe 2048 2048 100

Device 0: execution time 1415.08 ms, achieved 1214.06 gflops
oc 950mhz
Device 0: execution time 1114.06 ms, achieved 1542.09 gflops

UPDATE 1:
Nvidia and ATI working together!
opencl.dll from ati sdk 2.01

Found 2 platform(s).
platform[01104BA0]: profile: FULL_PROFILE
platform[01104BA0]: version: OpenCL 1.0 CUDA 3.0.1
platform[01104BA0]: name: NVIDIA CUDA
platform[01104BA0]: vendor: NVIDIA Corporation
platform[01104BA0]: extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_
gl_sharing cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_nv_d3d11_sharing cl_nv_comp
iler_options cl_nv_device_attribute_query cl_nv_pragma_unroll
platform[01104BA0]: Found 1 device(s).
   device[01104C08]: NAME: GeForce GTX 275
   device[01104C08]: VENDOR: NVIDIA Corporation
   device[01104C08]: PROFILE: FULL_PROFILE
   device[01104C08]: VERSION: OpenCL 1.0 CUDA
   device[01104C08]: EXTENSIONS: cl_khr_byte_addressable_store cl_khr_icd c
l_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_nv_d3d11_sharing cl_n
v_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_glob
al_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_ba
se_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64
   device[01104C08]: DRIVER_VERSION: 196.75

   device[01104C08]: Type: GPU
   device[01104C08]: EXECUTION_CAPABILITIES: Kernel
   device[01104C08]: GLOBAL_MEM_CACHE_TYPE: None (0)
   device[01104C08]: CL_DEVICE_LOCAL_MEM_TYPE: Local (1)
   device[01104C08]: SINGLE_FP_CONFIG: 0x3e
   device[01104C08]: QUEUE_PROPERTIES: 0x3

   device[01104C08]: VENDOR_ID: 4318
   device[01104C08]: MAX_COMPUTE_UNITS: 30
   device[01104C08]: MAX_WORK_ITEM_DIMENSIONS: 3
   device[01104C08]: MAX_WORK_GROUP_SIZE: 512
   device[01104C08]: PREFERRED_VECTOR_WIDTH_CHAR: 1
   device[01104C08]: PREFERRED_VECTOR_WIDTH_SHORT: 1
   device[01104C08]: PREFERRED_VECTOR_WIDTH_INT: 1
   device[01104C08]: PREFERRED_VECTOR_WIDTH_LONG: 1
   device[01104C08]: PREFERRED_VECTOR_WIDTH_FLOAT: 1
   device[01104C08]: PREFERRED_VECTOR_WIDTH_DOUBLE: 1
   device[01104C08]: MAX_CLOCK_FREQUENCY: 1404
   device[01104C08]: ADDRESS_BITS: 32
   device[01104C08]: MAX_MEM_ALLOC_SIZE: 229998592
   device[01104C08]: IMAGE_SUPPORT: 1
   device[01104C08]: MAX_READ_IMAGE_ARGS: 128
   device[01104C08]: MAX_WRITE_IMAGE_ARGS: 8
   device[01104C08]: IMAGE2D_MAX_WIDTH: 8192
   device[01104C08]: IMAGE2D_MAX_HEIGHT: 8192
   device[01104C08]: IMAGE3D_MAX_WIDTH: 2048
   device[01104C08]: IMAGE3D_MAX_HEIGHT: 2048
   device[01104C08]: IMAGE3D_MAX_DEPTH: 2048
   device[01104C08]: MAX_SAMPLERS: 16
   device[01104C08]: MAX_PARAMETER_SIZE: 4352
   device[01104C08]: MEM_BASE_ADDR_ALIGN: 256
   device[01104C08]: MIN_DATA_TYPE_ALIGN_SIZE: 16
   device[01104C08]: GLOBAL_MEM_CACHELINE_SIZE: 0
   device[01104C08]: GLOBAL_MEM_CACHE_SIZE: 0
   device[01104C08]: GLOBAL_MEM_SIZE: 919994368
   device[01104C08]: MAX_CONSTANT_BUFFER_SIZE: 65536
   device[01104C08]: MAX_CONSTANT_ARGS: 9
   device[01104C08]: LOCAL_MEM_SIZE: 16384
   device[01104C08]: ERROR_CORRECTION_SUPPORT: 0
   device[01104C08]: PROFILING_TIMER_RESOLUTION: 1000
   device[01104C08]: ENDIAN_LITTLE: 1
   device[01104C08]: AVAILABLE: 1
   device[01104C08]: COMPILER_AVAILABLE: 1
platform[0313A434]: profile: FULL_PROFILE
platform[0313A434]: version: OpenCL 1.0 ATI-Stream-v2.0.1
platform[0313A434]: name: ATI Stream
platform[0313A434]: vendor: Advanced Micro Devices, Inc.
platform[0313A434]: extensions: cl_khr_icd
platform[0313A434]: Found 2 device(s).
   device[0338CA70]: NAME: Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
   device[0338CA70]: VENDOR: GenuineIntel
   device[0338CA70]: PROFILE: FULL_PROFILE
   device[0338CA70]: VERSION: OpenCL 1.0 ATI-Stream-v2.0.1
   device[0338CA70]: EXTENSIONS: cl_khr_icd cl_khr_global_int32_base_atomic
s cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_lo
cal_int32_extended_atomics cl_khr_byte_addressable_store
   device[0338CA70]: DRIVER_VERSION: 1.0

   device[0338CA70]: Type: CPU
   device[0338CA70]: EXECUTION_CAPABILITIES: Kernel
   device[0338CA70]: GLOBAL_MEM_CACHE_TYPE: Read-Write (2)
   device[0338CA70]: CL_DEVICE_LOCAL_MEM_TYPE: Global (2)
   device[0338CA70]: SINGLE_FP_CONFIG: 0x7
   device[0338CA70]: QUEUE_PROPERTIES: 0x2

   device[0338CA70]: VENDOR_ID: 4098
   device[0338CA70]: MAX_COMPUTE_UNITS: 8
   device[0338CA70]: MAX_WORK_ITEM_DIMENSIONS: 3
   device[0338CA70]: MAX_WORK_GROUP_SIZE: 1024
   device[0338CA70]: PREFERRED_VECTOR_WIDTH_CHAR: 16
   device[0338CA70]: PREFERRED_VECTOR_WIDTH_SHORT: 8
   device[0338CA70]: PREFERRED_VECTOR_WIDTH_INT: 4
   device[0338CA70]: PREFERRED_VECTOR_WIDTH_LONG: 2
   device[0338CA70]: PREFERRED_VECTOR_WIDTH_FLOAT: 4
   device[0338CA70]: PREFERRED_VECTOR_WIDTH_DOUBLE: 0
   device[0338CA70]: MAX_CLOCK_FREQUENCY: 2698
   device[0338CA70]: ADDRESS_BITS: 32
   device[0338CA70]: MAX_MEM_ALLOC_SIZE: 536870912
   device[0338CA70]: IMAGE_SUPPORT: 0
   device[0338CA70]: MAX_READ_IMAGE_ARGS: 0
   device[0338CA70]: MAX_WRITE_IMAGE_ARGS: 0
   device[0338CA70]: IMAGE2D_MAX_WIDTH: 0
   device[0338CA70]: IMAGE2D_MAX_HEIGHT: 0
   device[0338CA70]: IMAGE3D_MAX_WIDTH: 0
   device[0338CA70]: IMAGE3D_MAX_HEIGHT: 0
   device[0338CA70]: IMAGE3D_MAX_DEPTH: 0
   device[0338CA70]: MAX_SAMPLERS: 0
   device[0338CA70]: MAX_PARAMETER_SIZE: 4096
   device[0338CA70]: MEM_BASE_ADDR_ALIGN: 32768
   device[0338CA70]: MIN_DATA_TYPE_ALIGN_SIZE: 128
   device[0338CA70]: GLOBAL_MEM_CACHELINE_SIZE: 64
   device[0338CA70]: GLOBAL_MEM_CACHE_SIZE: 65536
   device[0338CA70]: GLOBAL_MEM_SIZE: 1073741824
   device[0338CA70]: MAX_CONSTANT_BUFFER_SIZE: 65536
   device[0338CA70]: MAX_CONSTANT_ARGS: 8
   device[0338CA70]: LOCAL_MEM_SIZE: 32768
   device[0338CA70]: ERROR_CORRECTION_SUPPORT: 0
   device[0338CA70]: PROFILING_TIMER_RESOLUTION: 1
   device[0338CA70]: ENDIAN_LITTLE: 1
   device[0338CA70]: AVAILABLE: 1
   device[0338CA70]: COMPILER_AVAILABLE: 1
   device[04A30050]: NAME: Cypress
   device[04A30050]: VENDOR: Advanced Micro Devices, Inc.
   device[04A30050]: PROFILE: FULL_PROFILE
   device[04A30050]: VERSION: OpenCL 1.0 ATI-Stream-v2.0.1
   device[04A30050]: EXTENSIONS: cl_khr_global_int32_base_atomics cl_khr_gl
obal_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_e
xtended_atomics
   device[04A30050]: DRIVER_VERSION: CAL 1.4.556

   device[04A30050]: Type: GPU
   device[04A30050]: EXECUTION_CAPABILITIES: Kernel
   device[04A30050]: GLOBAL_MEM_CACHE_TYPE: None (0)
   device[04A30050]: CL_DEVICE_LOCAL_MEM_TYPE: Local (1)
   device[04A30050]: SINGLE_FP_CONFIG: 0x6
   device[04A30050]: QUEUE_PROPERTIES: 0x2

   device[04A30050]: VENDOR_ID: 4098
   device[04A30050]: MAX_COMPUTE_UNITS: 18
   device[04A30050]: MAX_WORK_ITEM_DIMENSIONS: 3
   device[04A30050]: MAX_WORK_GROUP_SIZE: 256
   device[04A30050]: PREFERRED_VECTOR_WIDTH_CHAR: 16
   device[04A30050]: PREFERRED_VECTOR_WIDTH_SHORT: 8
   device[04A30050]: PREFERRED_VECTOR_WIDTH_INT: 4
   device[04A30050]: PREFERRED_VECTOR_WIDTH_LONG: 2
   device[04A30050]: PREFERRED_VECTOR_WIDTH_FLOAT: 4
   device[04A30050]: PREFERRED_VECTOR_WIDTH_DOUBLE: 0
   device[04A30050]: MAX_CLOCK_FREQUENCY: 725
   device[04A30050]: ADDRESS_BITS: 32
   device[04A30050]: MAX_MEM_ALLOC_SIZE: 268435456
   device[04A30050]: IMAGE_SUPPORT: 0
   device[04A30050]: MAX_READ_IMAGE_ARGS: 0
   device[04A30050]: MAX_WRITE_IMAGE_ARGS: 0
   device[04A30050]: IMAGE2D_MAX_WIDTH: 0
   device[04A30050]: IMAGE2D_MAX_HEIGHT: 0
   device[04A30050]: IMAGE3D_MAX_WIDTH: 0
   device[04A30050]: IMAGE3D_MAX_HEIGHT: 0
   device[04A30050]: IMAGE3D_MAX_DEPTH: 0
   device[04A30050]: MAX_SAMPLERS: 0
   device[04A30050]: MAX_PARAMETER_SIZE: 1024
   device[04A30050]: MEM_BASE_ADDR_ALIGN: 4096
   device[04A30050]: MIN_DATA_TYPE_ALIGN_SIZE: 128
   device[04A30050]: GLOBAL_MEM_CACHELINE_SIZE: 0
   device[04A30050]: GLOBAL_MEM_CACHE_SIZE: 0
   device[04A30050]: GLOBAL_MEM_SIZE: 268435456
   device[04A30050]: MAX_CONSTANT_BUFFER_SIZE: 65536
   device[04A30050]: MAX_CONSTANT_ARGS: 8
   device[04A30050]: LOCAL_MEM_SIZE: 32768
   device[04A30050]: ERROR_CORRECTION_SUPPORT: 0
   device[04A30050]: PROFILING_TIMER_RESOLUTION: 1
   device[04A30050]: ENDIAN_LITTLE: 1
   device[04A30050]: AVAILABLE: 1
   device[04A30050]: COMPILER_AVAILABLE: 1
UPDATE 2:
DX formats included in optd3d

Posted in | No comments

GPGPU Image support!

Posted on 08:44 by Unknown

1. D3D
In doc there is a table "Hardware Support for Direct3D 11 Formats"

Format(DXGI_FORMAT_*)	# Bits	Format Target
Format(DXGI_FORMAT_*)	# Bits	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38
UNKNOWN	0	X																				X										X
R32G32B32A32_TYPELESS	128					X	X	X	X							X																X							X
R32G32B32A32_FLOAT	128	X	X		X	X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	o	o	X	X		X
R32G32B32A32_UINT	128	X	X		X	X	X	X	X	X						X		X					X	X								X	X	o	o		X		X
R32G32B32A32_SINT	128	X	X		X	X	X	X	X	X						X		X					X	X								X	X	o	o		X		X
R32G32B32_TYPELESS	96					X	X	X	X							X																X							X
R32G32B32_FLOAT	96	X	X		X	X	X	X	X	X	o			o		X	o	o	o¹													X	X	X	o	X	X		X
R32G32B32_UINT	96	X	X		X	X	X	X	X	X						X		o														X	X	X	o		X		X
R32G32B32_SINT	96	X	X		X	X	X	X	X	X						X		o														X	X	X	o		X		X
R16G16B16A16_TYPELESS	64					X	X	X	X							X																X							X
R16G16B16A16_FLOAT	64	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X	X	X
R16G16B16A16_UNORM	64	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X		X
R16G16B16A16_UINT	64	X	X			X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
R16G16B16A16_SNORM	64	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X		X
R16G16B16A16_SINT	64	X	X			X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
R32G32_TYPELESS	64					X	X	X	X							X																X							X
R32G32_FLOAT	64	X	X		X	X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X		X
R32G32_UINT	64	X	X		X	X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
R32G32_SINT	64	X	X		X	X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
R32G8X24_TYPELESS	64					X	X		X							X																X							X
D32_FLOAT_S8X24_UINT	64					X	X		X							X				X												X	X	X	o				X
R32_FLOAT_X8X24_TYPELESS	64					X	X		X	X	X	X		X	X	X																X					X		X
X32_TYPELESS_G8X24_UINT	64					X	X		X	X						X																X					X		X
R10G10B10A2_TYPELESS	32					X	X	X	X							X																X							X
R10G10B10A2_UNORM	32	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X	X	X
R10G10B10A2_UINT	32	X	X			X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
R10G10B10_XR_BIAS_A2_UNORM	32						X																									X						X	X
R11G11B10_FLOAT	32	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X
R8G8B8A8_TYPELESS	32					X	X	X	X							X																X							X
R8G8B8A8_UNORM	32	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X	X	X
R8G8B8A8_UNORM_SRGB	32					X	X	X	X	X	X			X		X	X	X	X													X	X	X	o	X	X	X	X
R8G8B8A8_UINT	32	X	X			X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
R8G8B8A8_SNORM	32	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X		X
R8G8B8A8_SINT	32	X	X			X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
R16G16_TYPELESS	32					X	X	X	X							X																X							X
R16G16_FLOAT	32	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X		X
R16G16_UNORM	32	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X		X
R16G16_UINT	32	X	X			X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
R16G16_SNORM	32	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X		X
R16G16_SINT	32	X	X			X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
R32_TYPELESS	32					X	X	X	X							X					X											X							X
D32_FLOAT	32					X	X		X							X				X												X	X	X	o				X
R32_FLOAT	32	X	X		X	X	X	X	X	X	X	X		X	X	X	X	X	X				X	X	X				X			X	X	X	o	X	X		X
R32_UINT	32	X	X	X	X	X	X	X	X	X						X		X					X	X	X	X	X	X	X	X	X	X	X	X	o		X		X
R32_SINT	32	X	X		X	X	X	X	X	X						X		X					X	X	X	X	X	X	X	X	X	X	X	X	o		X		X
R24G8_TYPELESS	32					X	X		X							X																X							X
D24_UNORM_S8_UINT	32					X	X		X							X				X												X	X	X	o				X
R24_UNORM_X8_TYPELESS	32					X	X		X	X	X	X		X	X	X																X					X		X
X24_TYPELESS_G8_UINT	32					X	X		X	X						X																X					X		X
R8G8_TYPELESS	16					X	X	X	X							X																X							X
R8G8_UNORM	16	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X		X
R8G8_UINT	16	X	X			X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
R8G8_SNORM	16	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X		X
R8G8_SINT	16	X	X			X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
R16_TYPELESS	16					X	X	X	X							X																X							X
R16_FLOAT	16	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X		X
D16_UNORM	16					X	X		X							X				X												X	X	X	o				X
R16_UNORM	16	X	X			X	X	X	X	X	X	X		X	X	X	X	X	X				X	X								X	X	X	o	X	X		X
R16_UINT	16	X	X	X		X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
R16_SNORM	16	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X		X
R16_SINT	16	X	X			X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
R8_TYPELESS	8					X	X	X	X							X																X							X
R8_UNORM	8	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X		X
R8_UINT	8	X	X			X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
R8_SNORM	8	X	X			X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X		X
R8_SINT	8	X	X			X	X	X	X	X						X		X					X	X								X	X	X	o		X		X
A8_UNORM	8					X	X	X	X	X	X			X		X	X	X	X				X	X								X	X	X	o	X	X
R9G9B9E5_SHAREDEXP	32					X	X	X	X	X	X			X		X																X
R8G8_B8G8_UNORM	16					X	X	X	X	X	X			X		X																X
G8R8_G8B8_UNORM	16					X	X	X	X	X	X			X		X																X
BC1_TYPELESS	4						X	X	X							X																X							X
BC1_UNORM	4						X	X	X	X	X			X		X																X							X
BC1_UNORM_SRGB	4						X	X	X	X	X			X		X																X							X
BC2_TYPELESS	8						X	X	X							X																X							X
BC2_UNORM	8						X	X	X	X	X			X		X																X							X
BC2_UNORM_SRGB	8						X	X	X	X	X			X		X																X							X
BC3_TYPELESS	8						X	X	X							X																X							X
BC3_UNORM	8						X	X	X	X	X			X		X																X							X
BC3_UNORM_SRGB	8						X	X	X	X	X			X		X																X							X
BC4_TYPELESS	4						X	X	X							X																X							X
BC4_UNORM	4						X	X	X	X	X			X		X																X							X
BC4_SNORM	4						X	X	X	X	X			X		X																X							X
BC5_TYPELESS	8						X	X	X							X																X							X
BC5_UNORM	8						X	X	X	X	X			X		X																X							X
BC5_SNORM	8						X	X	X	X	X			X		X																X							X
B8G8R8A8_TYPELESS	32					X	X	X	X							X																X							X
B8G8R8A8_UNORM	32	X	X			X	X	X	X	X	X			X		X	X	X	X													X	X	X	o	X	X	X	X
B8G8R8A8_UNORM_SRGB	32					X	X	X	X	X	X			X		X	X	X	X													X	X	X	o	X	X	X	X
B8G8R8X8_TYPELESS	32					X	X	X	X							X																X							X
B8G8R8X8_UNORM	32	X	X			X	X	X	X	X	X			X		X	X	X	X													X	X	X	o	X	X		X
B8G8R8X8_UNORM_SRGB	32					X	X	X	X	X	X			X		X	X	X	X													X	X	X	o	X	X		X
BC6H_TYPELESS	8						X	X	X							X																X							X
BC6H_UF16	8						X	X	X	X	X			X		X																X							X
BC6H_SF16	8						X	X	X	X	X			X		X																X							X
BC7_TYPELESS	8						X	X	X							X																X							X
BC7_UNORM	8						X	X	X	X	X			X		X																X							X
BC7_UNORM_SRGB	8						X	X	X	X	X			X		X																X							X

Buffer
Input Assembler Vertex Buffer
Input Assembler Index Buffer
Stream Output Buffer
Texture1D
Texture2D
Texture3D
TextureCube
Shader ld
Shader sample (any filter)
Shader sample_c (comparison filter)
Shader sample (mono 1-bit filter)
Shader gather4
Shader gather4_c
Mipmap
Mipmap Auto-Generation
RenderTarget
Blendable RenderTarget
Depth/Stencil Target
Raw UAV and SRV
Structured UAV and SRV
Typed UAV
UAV Typed Store
UAV Typed Load
UAV Atomic Add
UAV Atomic Bitwise Ops
UAV Atomic Cmp Store or Cmp Exch
UAV Atomic Exchange
UAV Atomic Signed Min or Max
UAV Atomic Unsigned Min or Max
CPU Lockable
4x Multisample RenderTarget
8x Multisample RenderTarget
Other Multisample Count RT
Multisample Resolve
Multisample Load
Display Scan-Out
Cast Within Bit Layout

A API for getting supported formats is ID3D11Device::CheckFormatSupport..

Would be good to write a program for checking the formats supported by AMD and Nvidia..

In CUDA 3.0 you have:

1, 2 or 4 components:
*Signed or unsigned 8-, 16- or 32-bit integers (18)
*16-bit floats (currently only supported through the driver (6)
API), or 32-bit floats
24 tex formats

For CUDA-GL interop (from forums):

works for FP textures:
XXXX = R,RG,RGB or RGBA
YY = 16 or 32

i.e. 8 FP formats

works for integer texes:
XXXX = R,G,RGB or RGBA
YY = 8,16 or 32
ZZ = I or UI

i.e. 24 FP formats

depth renderbuffers doesn't work I don't know if color renderbuffers work I assume yes at least for CUDA 3.0 final..

use:
glGenTextures(1,&tex);
glBindTexture(GL_TEXTURE_2D , tex);
glTexImage2D(GL_TEXTURE_2D , 0 , GL_XXXXYYF , width , height , 0 , GL_RGBA , GL_FLOAT , 0);
glTexParameteri(GL_TEXTURE_2D , GL_TEXTURE_MIN_FILTER , GL_NEAREST);
cudaGraphicsGLRegisterImage (&resource , tex , GL_TEXTURE_2D , cudaGraphicsMapFlagsNone);
for integer texes change to that:
glTexImage2D(GL_TEXTURE_2D , 0 , GL_XXXXYYZZ , width , height , 0 , GL_RGBA_INTEGER , GL_UNSIGNED_BYTE , 0);
notes:
Notice that it is important to set the minification filter to GL_NEAREST.
In conclusion, it looks like the cudaGraphicsGL interface is working for most formats, excluding normalized internal formats such as the commonly used GL_RGBA8 format.

cuda GL allows to use RGB texes altough CUDA seems not from DOC!

OCL DX interop for Nvidia:

------------------------------------------------------------------
   DXGI Format cl_channel_order cl_channel_type
   ------------------------------ ---------------- ---------------
   DXGI_FORMAT_R32G32B32A32_FLOAT CL_RGBA CL_FLOAT
   DXGI_FORMAT_R32G32B32A32_UINT CL_RGBA CL_UNSIGNED_INT32
   DXGI_FORMAT_R32G32B32A32_SINT CL_RGBA CL_SIGNED_INT32
   DXGI_FORMAT_R16G16B16A16_FLOAT CL_RGBA CL_HALF_FLOAT
   DXGI_FORMAT_R16G16B16A16_UNORM CL_RGBA CL_UNORM_INT16
   DXGI_FORMAT_R16G16B16A16_UINT CL_RGBA CL_UNSIGNED_INT16
   DXGI_FORMAT_R16G16B16A16_SNORM CL_RGBA CL_SNORM_INT16
   DXGI_FORMAT_R16G16B16A16_SINT CL_RGBA CL_SIGNED_INT16
   DXGI_FORMAT_R8G8B8A8_UNORM CL_RGBA CL_UNORM_INT8
   DXGI_FORMAT_R8G8B8A8_UINT CL_RGBA CL_UNSIGNED_INT8
   DXGI_FORMAT_R8G8B8A8_SNORM CL_RGBA CL_SNORM_INT8
   DXGI_FORMAT_R8G8B8A8_SINT CL_RGBA CL_SIGNED_INT8
   DXGI_FORMAT_R32G32_FLOAT CL_RG CL_FLOAT
   DXGI_FORMAT_R32G32_UINT CL_RG CL_UNSIGNED_INT32
   DXGI_FORMAT_R32G32_SINT CL_RG CL_SIGNED_INT32
   DXGI_FORMAT_R16G16_FLOAT CL_RG CL_HALF_FLOAT
   DXGI_FORMAT_R16G16_UNORM CL_RG CL_UNORM_INT16
   DXGI_FORMAT_R16G16_UINT CL_RG CL_UNSIGNED_INT16
   DXGI_FORMAT_R16G16_SNORM CL_RG CL_SNORM_INT16
   DXGI_FORMAT_R16G16_SINT CL_RG CL_SIGNED_INT16
   DXGI_FORMAT_R8G8_UNORM CL_RG CL_UNORM_INT8
   DXGI_FORMAT_R8G8_UINT CL_RG CL_UNSIGNED_INT8
   DXGI_FORMAT_R8G8_SNORM CL_RG CL_SNORM_INT8
   DXGI_FORMAT_R8G8_SINT CL_RG CL_SIGNED_INT8
   DXGI_FORMAT_R32_FLOAT CL_R CL_FLOAT
   DXGI_FORMAT_R32_UINT CL_R CL_UNSIGNED_INT32
   DXGI_FORMAT_R32_SINT CL_R CL_SIGNED_INT32
   DXGI_FORMAT_R16_FLOAT CL_R CL_HALF_FLOAT
   DXGI_FORMAT_R16_UNORM CL_R CL_UNORM_INT16
   DXGI_FORMAT_R16_UINT CL_R CL_UNSIGNED_INT16
   DXGI_FORMAT_R16_SNORM CL_R CL_SNORM_INT16
   DXGI_FORMAT_R16_SINT CL_R CL_SIGNED_INT16
   DXGI_FORMAT_R8_UNORM CL_R CL_UNORM_INT8
   DXGI_FORMAT_R8_UINT CL_R CL_UNSIGNED_INT8
   DXGI_FORMAT_R8_SNORM CL_R CL_SNORM_INT8
   DXGI_FORMAT_R8_SINT CL_R CL_SIGNED_INT8

OCL supported textures see my program:
clGetSupportedImageFormats
use

void getimageinfo(cl_context context,cl_mem_flags m,cl_mem_object_type te)

{

size_t num_entries; cl_image_format *image_formats;

cl_int status=clGetSupportedImageFormats (context,m,te,0,NULL,&num_entries);

if(status==CL_SUCCESS&&num_entries>0)

{

image_formats=(cl_image_format*)malloc(num_entries*sizeof(cl_image_format));

status=clGetSupportedImageFormats (context,m,te,num_entries,image_formats,NULL);

if(status==CL_SUCCESS)

{

int o,t;

int i,j;

cl_int orders[]={CL_R, CL_A,CL_INTENSITY, CL_LUMINANCE,CL_RG, CL_RA,CL_RGB,CL_RGBA,CL_ARGB, CL_BGRA};

char *or[]={"CL_R", "CL_A","CL_INTENSITY", "CL_LUMINANCE","CL_RG", "CL_RA","CL_RGB","CL_RGBA","CL_ARGB", "CL_BGRA"};

cl_int types[]={

CL_SNORM_INT8 , CL_SNORM_INT16, CL_UNORM_INT8, CL_UNORM_INT16, CL_UNORM_SHORT_565, CL_UNORM_SHORT_555, CL_UNORM_INT_101010,CL_SIGNED_INT8,

CL_SIGNED_INT16, CL_SIGNED_INT32, CL_UNSIGNED_INT8, CL_UNSIGNED_INT16, CL_UNSIGNED_INT32, CL_HALF_FLOAT, CL_FLOAT};

char * tt[]={"CL_SNORM_INT8" ,"CL_SNORM_INT16","CL_UNORM_INT8","CL_UNORM_INT16","CL_UNORM_SHORT_565","CL_UNORM_SHORT_555","CL_UNORM_INT_101010",

"CL_SIGNED_INT8","CL_SIGNED_INT16","CL_SIGNED_INT32","CL_UNSIGNED_INT8","CL_UNSIGNED_INT16","CL_UNSIGNED_INT32","CL_HALF_FLOAT","CL_FLOAT"};

for(i=0; i

{

for(j=0; j

{

if(image_formats[i].image_channel_order==orders[j])

o=j;

}

for(j=0; j

{

if(image_formats[i].image_channel_data_type==types[j])

t=j;

}

printf("Format %d: %s, %s\n",i,or[o],tt[t]);

}

free(image_formats);

}

AMD and Nvidia return same for all args cl_mem_flags flags read or write only and cl_mem_object_type image_type set to 2d or 3d.. perhaps 3d write could report 0?

Nvidia:

Format 0: CL_R, CL_FLOAT
Format 1: CL_R, CL_HALF_FLOAT
Format 2: CL_R, CL_UNORM_INT8
Format 3: CL_R, CL_UNORM_INT16
Format 4: CL_R, CL_SNORM_INT16
Format 5: CL_R, CL_SIGNED_INT8
Format 6: CL_R, CL_SIGNED_INT16
Format 7: CL_R, CL_SIGNED_INT32
Format 8: CL_R, CL_UNSIGNED_INT8
Format 9: CL_R, CL_UNSIGNED_INT16
Format 10: CL_R, CL_UNSIGNED_INT32
Format 11: CL_A, CL_FLOAT
Format 12: CL_A, CL_HALF_FLOAT
Format 13: CL_A, CL_UNORM_INT8
Format 14: CL_A, CL_UNORM_INT16
Format 15: CL_A, CL_SNORM_INT16
Format 16: CL_A, CL_SIGNED_INT8
Format 17: CL_A, CL_SIGNED_INT16
Format 18: CL_A, CL_SIGNED_INT32
Format 19: CL_A, CL_UNSIGNED_INT8
Format 20: CL_A, CL_UNSIGNED_INT16
Format 21: CL_A, CL_UNSIGNED_INT32
Format 22: CL_RG, CL_FLOAT
Format 23: CL_RG, CL_HALF_FLOAT
Format 24: CL_RG, CL_UNORM_INT8
Format 25: CL_RG, CL_UNORM_INT16
Format 26: CL_RG, CL_SNORM_INT16
Format 27: CL_RG, CL_SIGNED_INT8
Format 28: CL_RG, CL_SIGNED_INT16
Format 29: CL_RG, CL_SIGNED_INT32
Format 30: CL_RG, CL_UNSIGNED_INT8
Format 31: CL_RG, CL_UNSIGNED_INT16
Format 32: CL_RG, CL_UNSIGNED_INT32
Format 33: CL_RA, CL_FLOAT
Format 34: CL_RA, CL_HALF_FLOAT
Format 35: CL_RA, CL_UNORM_INT8
Format 36: CL_RA, CL_UNORM_INT16
Format 37: CL_RA, CL_SNORM_INT16
Format 38: CL_RA, CL_SIGNED_INT8
Format 39: CL_RA, CL_SIGNED_INT16
Format 40: CL_RA, CL_SIGNED_INT32
Format 41: CL_RA, CL_UNSIGNED_INT8
Format 42: CL_RA, CL_UNSIGNED_INT16
Format 43: CL_RA, CL_UNSIGNED_INT32
Format 44: CL_RGBA, CL_FLOAT
Format 45: CL_RGBA, CL_HALF_FLOAT
Format 46: CL_RGBA, CL_UNORM_INT8
Format 47: CL_RGBA, CL_UNORM_INT16
Format 48: CL_RGBA, CL_SNORM_INT16
Format 49: CL_RGBA, CL_SIGNED_INT8
Format 50: CL_RGBA, CL_SIGNED_INT16
Format 51: CL_RGBA, CL_SIGNED_INT32
Format 52: CL_RGBA, CL_UNSIGNED_INT8
Format 53: CL_RGBA, CL_UNSIGNED_INT16
Format 54: CL_RGBA, CL_UNSIGNED_INT32
Format 55: CL_BGRA, CL_UNORM_INT8
Format 56: CL_BGRA, CL_SIGNED_INT8
Format 57: CL_BGRA, CL_UNSIGNED_INT8
Format 58: CL_ARGB, CL_UNORM_INT8
Format 59: CL_ARGB, CL_SIGNED_INT8
Format 60: CL_ARGB, CL_UNSIGNED_INT8
Format 61: CL_INTENSITY, CL_FLOAT
Format 62: CL_INTENSITY, CL_HALF_FLOAT
Format 63: CL_INTENSITY, CL_UNORM_INT8
Format 64: CL_INTENSITY, CL_UNORM_INT16
Format 65: CL_INTENSITY, CL_SNORM_INT16
Format 66: CL_LUMINANCE, CL_FLOAT
Format 67: CL_LUMINANCE, CL_HALF_FLOAT
Format 68: CL_LUMINANCE, CL_UNORM_INT8
Format 69: CL_LUMINANCE, CL_UNORM_INT16
Format 70: CL_LUMINANCE, CL_SNORM_INT16

AMD:

Format 0: CL_RGBA, CL_UNORM_INT8
Format 1: CL_RGBA, CL_UNORM_INT16
Format 2: CL_RGBA, CL_SIGNED_INT8
Format 3: CL_RGBA, CL_SIGNED_INT16
Format 4: CL_RGBA, CL_SIGNED_INT32
Format 5: CL_RGBA, CL_UNSIGNED_INT8
Format 6: CL_RGBA, CL_UNSIGNED_INT16
Format 7: CL_RGBA, CL_UNSIGNED_INT32
Format 8: CL_RGBA, CL_HALF_FLOAT
Format 9: CL_RGBA, CL_FLOAT
Format 10: CL_BGRA, CL_UNORM_INT8

OCL-GL interop I don't know:
for Nvidia is either the 70 above or the CUDA-GL supported formats or the GL equivalent of CUDA interop.. I suspect the CL_RGB ones supported..
for AMD either the CL image ones or CAL DX interop ones
I suspect RGB formats
AMD CAL:

CAL has textures exposed and CAL DX interop would be good to explore..

Posted in | No comments

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Sunday, 21 March 2010

What's for CUDA 3.1 and OpenGL 3.3/4.1!

Thursday, 18 March 2010

raw data..

What's left in OpenGL 4.0? and more raw info..

Sunday, 7 March 2010

GPU computing toys!

GPGPU Image support!

Popular Posts

Blog Archive

About Me