Whises for OpenCL 1.1 and more! ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Make core DirectCompute 5.0 hardware features:
posted http://www.khronos.org/message_boards/viewtopic.php?f=41&t=2160
*Atomics to global and local mem. (int32 base and extended extensions)
-> now that is supported would be good to add:
*Append/consume buffers (see AMD stuff), a global queue/stack accesable with no hazards..
*Byte addressable support.
*Half support (cl_khr_fp16)
*Require that local mem is not of type global (as in 4xxx cards due to write to LDS restrictions..)
*Expanded DirectCompute 5.0 integer support (bit count,bit reverse,etc..)

As doubles (cl_khr_fp64) is an optional feature of compute shaders as 57xx proof that no luck..

Also if it's not currently required require:

*Image support for FULL profile.
*OpenGL interop for GPU devices.

Add extensions or promote to core depending if is AMD/Nvidia specific support or multivendor:

* Multivendor:
*Add support for accessing system mem from GPU kernels:
thats currently supported in both Nvidia and AMD devices an exposed in CUDA 2.2 and up and CAL.
so called pinned system mem (in CUDA 2.2 for GT 200 devices), host mem export (AMD CAL)
*Implement DirectX interop (AMD ships header)
*Getting info of integer support.. if there are native 24 int muls (CUDA devices before Fermi and AMD 5xxx (every ALU)) or int32 muls (Fermi, AMD 4xxx and 5xxx(only 5th ALU))..

AMD proposed ones (some are said hardware features 5xxx press kit some 4xxx hardware support):

*Global Data Share and Wave sync support (GDS,etc..)
*Native SAD hardware support.
*Expose registers shared per SIMD.. (shared registers avaiable in compute shader in CAL which allow doing reductions in fixed number of steps say 2 or 3 vs. logN)

Nvidia ones:

*Improve memory API for supporting CUDA 2.2 mem impovements: Expand support for creating "shared pinned buffers" (in cuda parlance) (buffers of host mem that are pinned and usable from multiple GPUs as pinned mem (using DMA)
and also shared pinned system mem.

*Expose partial simultaneous mem image objects to have read/write support with strict limitations: exposing current RWTexture Direct3D 11 abilities and also of NV_texture_barrier OpenGL extension of reading to an already bound FBO texture
of reading the same texel before writing to it..

*Expose interop with CUDA:
Code interop: support for interchanging PTX kernel code from CUDA functions or OpenCL functions with identical name and arguments (signature) and using at clBuildfromBinaries..
Mem interop: Ability to use mem buffers allocated from CUDA in OpenCL or viceversa..
This should allow directly suportig proposed "shared pinned buffers"

*Fermi support. Provide new extensions supporting this features:

*Expose function pointer and stack support which provides true function calls and recursivity..
*Expose Fermi support for executing host code inside kernels
*Expose Fermi support for allocating mem in kernels (malloc/free functions)
*Expose C++ language in Kernels (?)
*Expose expanded information of ECC support: say ECC protected registers, and mem(local/global), ECC protected path from mem GPU <-> GDDR chips.. also if possible ECC codes info: error detection capability (Fermi can detect 3 bits in and 1 bit recovery support for every xx bits..)
*Add perhaps some exception support (assuming not full C++ support as CUDA 3.0) for managing/getting acknowledged of irrecoverable errors (where (in mem chips or registers) in kernel code.. If not possible in kernel code at least finish kernel and return via some mechanism to the host this info..
*Add perhaps some info of where atomics are implemented for knowing if we can expect high performance or not (say if they are handled in L2/L3 caches (Fermi) or in memory controllers or compute units (ALUs) (preFermi))
Also NVIDIA implement some features that require no extension to OpenCL API as API model allow that.. and allow getting device info querying information of if it's avaiable and other device info support:

For example using multiple command_queues and events support for hardware that supports it:
*Concurrent mem/kernel exec.. CUDA 1.1 devices (G9x,GT200,Fermi) and AMD(?)
*Concurrent kernel execution.. Fermi (also AMD on 5xxx)
*Concurrent H2D and D2H.. using Fermi twin DMA engines.

*Predication support (I have doubts?) Equivalent to CMOV avoiding using branching hardware. Basically avoiding that conditional code gets executed executing both paths.(?)

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Saturday, 31 October 2009

Whises for OpenCL 1.1 and more!

0 comments:

Post a Comment

Popular Posts

Blog Archive

About Me