GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Monday, 4 March 2013

My wishes for OCL 2.0

Posted on 14:28 by Unknown
Hi,
I think it's time for publishing my OpenCL 2 requests so they maybe get considered for inclusion into it:
I'm not requesting what it hopefully will be in it like C++ extensions etc..
Depending on wheter they plan for support on existing GPUs or not will determine if some of these can be included. Anyway getting a cl_khr or cl_ext extension would be good..
but just before it a good remainder of thing that still are to being implemented..

*starting to see cl_ext_device_fission implemented on GPUs that should be doable on AMD 7xxx GPUs but on NV GK110 still not? even better seems new AMD Sea Islands should support partitioning in up to 8 sub GPUs..
*Implementing new OCL 1.2 extensions like graphic ones MSAA and depth access..

For OpenCL 2 would be good to have:
*Atomic counters (cl_ext_atomic_counters_32) in core.. they provide an order of magnitude improvement vs global atomics at least on old D3D11 HW (Fermi, AMD 5xxx series) and are foundation of HW accelerated queues.
*Kernels can send interrupts to CPU and/or initiate host system calls.. that seems coming for a while I think even Fermi whitepaper suggested that but still no avaiable.. AMD SI support SEND_MSG in ISA as Lottes suggest in his blog so AMD should be able too..
*warp/wavefront vote functions: this functions are in NV HW since GTX 2xx (2008) useful for example in currently most better dynamic mem allocator for GPUs see "Fast Dynamic Memory Allocator for Massively Parallel Architectures" they said:
"The used hardware must provide a voting function for an effi cient implementation" thus seems and OpenCL port will need exposure of that..
*Dynamic parallelism: well that should be expected also now that GK110 is shipping and also seems SI could support some limited form of it as shown in a ADFS session..
*Named barriers: Well this is shipping in CUDA since Fermi days and can be used for warp specialization like in CUDADMA project that can bring better memory bandwith explotation in some apps and also as shown in HPP study can bring support for "true function composability" i.e. GPU functions that use barriers can call other GPU functions that use barriers without breaking expected usage see HPP paper by Gaster et al.
*Crossvendor MultiGPU like CUDA P2P functionality: i.e. memory from one GPU addressable by other GPU directly from kernel without previous copy (also present in cl_amd_bus_addressable in AMD OCL)
*Exposing some common intra warp/wavefront ops? (like existing NV shuffle.. makes sense more like median, min/max could be beneficial for platforms like Xeon Phi but not on GPUs)
*Expose some cross vendor multimedia extension ISAs? i.e. some common cl_amd_media_ops/cl_amd_media_ops2 and ptx SIMD instructions.. this can be good jointly with interop with video encoders and encoders for accelerated video processing and even NV uses in their fast raytracing kernels..
*Finalize to bring parity vs exisiting compute exposure in graphics APIs like OGL 4.3/D3D 11 compute shaders: like said atomic counters where one thing..
->other being new gather4 instuctions..
->DispatchComputeIndirect: i.e. ability to launch kernel with size of workgroup total size being fetched from GPU mem.. it's more efficient for variable work kernels that depend on work generated by a previous kernel.. in this case we avoid a CPU trip but note that could be done with new Dynamic Parallelism so perhaps doesn't need to be exposed..
->Promote into core MSAA and depth extensions
->MipMap support like in CUDA 5
->compressed tex formats support
->a cross vendor extension for bindless support (assuming will get broad support in coming years)
->cross vendor ext for sparse texture/buffer support..


To finalize also exposing advanced control of ld/st operations such as cache modifiers and even using texture path (in GK110)..

Finally seems future GPUs could support unified register/local mem mem so explicit size control for better optimization could be good, also seems local mem could be allocated dynamicaly inside a kernel via extension to barrier function argument for better use of it so an extension to barrier operator could be good and also a scalar processor is present on recent archs so altough could be intended for executing common scalar code in kernel (extracted by a compiler) could also be exposed for direct programmability..

Coming not shortly(?) for me with atomic counter bringing possibly very fast queues and exposing all graphics functionality in kernels in OpenCL like said above primary targets to expose are rasterizer, z-buffer and rop functinality.. 
*the most interesting for me is exposing Z-buffer.. GPUDet papers shows an usage of it..
*exposing rasterizer what could be?: exposing perhaps via a generalized dynamic parallelism a funtion that takes a buffer or "geometry" to rasterize and some kernel that would be called in some specified grid size (tiles 8x8?) via dynamic parallelism.. all in all somewhat crazy seems..

More thoughts?
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Porting CUDA to OpenCL!
    Well so you want to port CUDA code to OpenCL: you are in AMD GPU competition of porting Cuda codes to opencl (see previous post) or you are ...
  • Megapost!
    Today fools{ *GTX 485 is 512 cores 3gbytes gddr5 and 850/1750 shaders.. *ati 5990 has 4 gpus in board.. *bulldozer benchmarks }end fools.. A...
  • About ATI and Nvidia drivers (OCL included)!
    Hi I have been investigating AMD and Nvidia drivers.. for 10.3 there are 3d hooks support for 120hz monitors but is d3d9 d3d10 or d3d11 enab...
  • things found in CUDA forums
    Also some CUDA news: Mandelbulb stereo angalyph -> have to port to 3D Vision http://forums.nvidia.com/index.php?showtopic=150985&st=2...
  • opencl/opengl linux interop! seen in opencl cuda 3.0 sdk samples
    Following my OpenCL/OpenGL Window interop work: now has come to Linux  for Nvidia GPU computing registered developers via 195.17 driver! Als...
  • State of the blog..
    Sorry for the delay guys of posting code of Apple OpenCL demos port.. the blog has been with no updated for more than 2 weeks in this rapid ...
  • Optix and OpenCL SDKs with Visual Studio 2010
    Optix 1.0 ========= install cg download Cmake 2.80 cmake says error dumpbin not found and it is cuda doesn't work with vc2010 so copy pt...
  • CUDA 3.0 forums stuff!
    1.Getting CUBIN instead of ELF If you need the older text format, you can disable ELF cubins in nvcc.profile by changing "CUBINS_ARE_EL...
  • News from the web!
    Some things learned in AMD forums: 1.Why 3xxx no OpenCL: Compute shader mode is a hardware feature that did not exist in the HD38XX line of ...
  • Shaders: measuring perf, source translation and parsing different languages!
    Hi, I hope to be pretty exhaustive of options for parsing and translating between graphics and compute shaders ( some open source) For DX sh...

Blog Archive

  • ▼  2013 (5)
    • ►  September (1)
    • ▼  March (3)
      • 2013: a good year for new API revisions and launches?
      • Nvidia GTC thoughts: ARM,roadmap,demos..
      • My wishes for OCL 2.0
    • ►  February (1)
  • ►  2012 (1)
    • ►  December (1)
  • ►  2010 (46)
    • ►  July (4)
    • ►  May (1)
    • ►  April (3)
    • ►  March (9)
    • ►  February (15)
    • ►  January (14)
  • ►  2009 (125)
    • ►  December (51)
    • ►  November (53)
    • ►  October (21)
Powered by Blogger.

About Me

Unknown
View my complete profile