2013 ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Wednesday, 25 September 2013

What I expect from GCN2.0 and APU13 conference..

Posted on 06:27 by Unknown

Hi,

today is the big day for AMD GPUs this year and seems we are getting a big die chip (due to made 28nm) to compete with Titan.. well may be is only that more ALUs+BW and some new ISAs but being named GCN 2.0 I expect at least this no commonly said things: (note that all of that apply to Maxwell GPUs and to some extent upcoming Intel Gen8 and 9 GPUs shame no news this year at IDF I'm only one that missed it)

Docs: I expect some Volcanic Islands ISA guide like Bonaire ISA doc was released earlier this year..

User software: NV ShadowPlay competition

Dev software: Hope as said new HSA SDK comes this year at least for APU13 conf.. seems OCL 2.0 beta build are premature but hope AMD OCL driver finally exposes SPIR+ new 1.2 graphics extensions like msaa and depth ones.. and perhaps even mipmaps CL 2.0 one even before CL 2.0 betas.. also would like to see HW GPU debugging of OCL kernels added to their software..

I expect some Media SDK release that includes VCE usage plus fast framebuffer capture as said last year HSA conf.. so basically similar to Grid SDK let's see if they make it public without NDAs..

GPU:

obvious(?):nested parallelism (aka CUDA dyn par), SVM (also coming to Maxwell)

not so obvious(? well AMD said some of that for 2014 in roadmaps let's see):

HW GPU debugging: like since pre-Fermi days on NV! note Linux support requires latest sm_35 devices currently.. which takes advantage of next feature:

HW context switching so we avoid GPU resets for long shaders, we could get some GPU process explorer where we can kill processes perhaps but well this seems better to be included as GPU tab in Win9(?).. seems WDDM 1.2 supports GPU drv report context switch granularity I expect GCN 2.0 and Maxwell to expose up to HW ISA instruction (well

texture support: support astc format

video decode: add 4K HEVC decode and expose via new DXVA 2.0 extension and also to XvBA and also to VDPAU and VAAPI now that radeon OSS drivers support video decode..

note in this case for NV I want also VDPAU support..

VCE: up to 4k encode (h.264 only not hevc I expect) so that with upcoming Nvidia ShadowPlay like support for AMD GPUs in RadeonPro infinity we can capture

Mac world:

Note also that MacOs currently lacks HW GPU debugging on NV GPUs in Nsight Eclipse for Mac.. let's see how evolves but hope that Macos 11 or 10.10 adds support for all these new features so basically HEVC HW decode on GPUs that support hope that finally exposes publicily HW encode APIs as no Quicksync nor NVENC for Mac currently.. also hope Astc OGL extension and some Xcode support for GPU HW debugging of OCL kernels on new GPUs that would be awesome..

as said also expose some GPU tab to process explorer with GPU process kill etc..

Related APU13 sessions are posted don't know if all I'm listing here ones I found interesting from AMD workers almost all:

(note seems also no OGL session this year but would be good that we are missing some new GPU features that will be implemented somewhat later perhaps optimized sparse textures and AMD_sparse_buffer?)

The HSA System Architecture Requirements: An overview

AMD Neural Networks Library

HSA Compilers Technology

WT-4070 unity webcl and webgl dx11 level?

ANGLE and cross-platform WebGL support es 3.0 status

AMD Video Compression Engine: The Route towards Low-Latency Cloud Gaming Solutions

Optimizing FFMPEG and Handbrake Using OpenCL™ and Other AMD HW Capabilities

An Introduction to SPIR for OpenCL™ Application Developers and Compiler Developers

Accelerating and Evaluating OpenCL™ Graph Applications

Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms

Automated CUDA-to-OpenCL™ Translation with CU2CL: What’s Next?

Optimizing Raytracing on GCN with AMD Development Tools

Designing a game audio engine for HSA

A Crash Course on the AMD GCN Architecture

Direct Compute in Gaming

Accelerate Crypto Applications with AMD Platform Security Processor

AMD Content Solution Enablement Strategy

LibreOffice Calc Spreadsheet Formulae Optimizations using OpenCL

I expect from these sessions:

Optimizing Raytracing on GCN with AMD Development Tools: some library released to compete with Optix

AMD Neural Networks Library: more libraries to play

Posted in | No comments

Wednesday, 20 March 2013

2013: a good year for new API revisions and launches?

Posted on 11:05 by Unknown

Hi I was thinking this year could probably be a year with a lot of new specs published related to GPU world..
just released:
*OpenMP 4.0rc2 with accel device targets added
*OpenACC 2.0 brings GK110 cuda 5.0 features to directives world..
I expect OpenACC 2.0 support PGI imp in June which will also give support for MIC and AMD via OpenCL (so Intel HD5000 too in Windows version?)
also perhaps CAPS for summer..
Regarding OpenMP 4 I see Intel Fortran update with SIMD features already and perhaps some beta by summer for Intel CPUs (and Xeon Phi?) maybe production quality at SC13 both Intel compilers 2014 and some PGI beta..
Not so HPC world related, let's see:
*WebCL final spec should be coming soon.. let's see how fast they progress.. I would like to see optional 1.2 support with all new SA2012 exts for MSAA and depth sharing.. what troubles me is OpenCL extension mechanism for getting function address..
*WebGL 2.0 should be announced exposing OGL ES 3.0 to WebGL world.. I think ANGLE support is a year lagging.. I still would want a version with "destkop profile" with core support and number version setting (like 4.2 core) and optional extensions usage if wanted..
*OpenCL 2.0 for Siggraph?
*Dx 12? for Windows blue alpha this year at PDC (Oct-Nov 2012) I hope is a major revision spec with new DirectCompute features (recursion, pointers,etc..)..
*C++ AMP 2.0? new C++ AMP with function calls recursion separate compilation and pointer support would be good..
Khronos also is expected this year to bring:
*StreamInput for unified LeapMotion, Occulus and Kinect support among other things..
*OpenCV like api
Also next year with Maxwell expected DX12 support:
*OGL 5
*CUDA 6
and for mobile platforms we expect at least OGL ES 4.0..

Posted in | No comments

Nvidia GTC thoughts: ARM,roadmap,demos..

Posted on 06:24 by Unknown

Last day anouncement of full OGL 4.3 support on Tegra products on ARM is huge..
I assume also Nvidia will bring all his supported extensions such bindless exts,VBUM(pointer support in GLSL), direct state access etc.. and not only core OGL 4.3..
also related Nvidia exposes some features not exposed in DX11 like support for writes to image2DMS objects will be like RWTexture2DMS if that existed on DX11 (note only RWtexture2d and texture2dms exist)..
Nvidia showed on Ubuntu and one question will be if they will/can? bring all that goodness to Android world.. I suspect at least if not they will bring via a lot of extensions to OGL ES driver similar with what they are doing to Tegra4 this year..
Also note they shown 319 drivers series and hope this will bring EGL support in NV drivers for Linux at least (related to Mir on Ubuntu efforts) and not only EGL OGL ES support like AMD currently but full OGL profile in EGL and as said for Windows too.. note I'm expecting also Optimus in Linux world soon so that could be the series..
note Nvidia showed a demo running Optix on ARM (on his blog says 1day port) and also CUDA 5.0 demos and finally IslandGL well this is a port of one of Fermi DX11 demos ISland which is avaiable to download as demo on Nvidia and features highly tesselation usage.. with that comes clear why renewed interested in NV to bring an OGL 4.0 SDK.. hope almost al techniques in NV DX11 SDK get ported as stocastich transparency to name one..
with that I recap tooling support so we experience seamless transition from PC world to Tegra world..
*Cg for ARM: bring Cg to ARM world or at least and offline GLSL 4.3 compiler that cgc has..
*OGL 4.0 SDK for ARM: as said new OGL SDK is in the works and a ARM version should be good
*OpenCL for ARM: Nvidia doesn't talk about it but I think market pressure will force them to port to ARM also.. note still no OCL 1.2 with new exts in end 2012
*TXAA: With OGL 4.3 and AAA games becoming norm in Android and IOS markets is a matter of time until one wants TXAA that tegra5 will have exposed in OGL..
*Optix: well shown so ported..
*Physx and Apex: Physx is for Android shipping interesting will be to see if they enable GPU support once Tegra5 ships with CUDA GPU (I suspect yes expecting 128/192 shaders and that also had G80 which they supported).. more interesting will be to see if Apex (right windows world only) get ported as that brings turbulence /GPU rigid bodies with fracturing the former one which is being used a lot lately in f2p games..

*Also note that will bring full DX11 to Windows on ARM products which is impressive.. I'm not certain if DX11.1 or not as GPU shipping in Kayla could be GK208 a 3rd gen Kepler says press so that feature could be in..

Lastly an slightly unrelated I would like that NV released her impressive demos of last two/three years as executables so we can test in house like:
GTC 2010
Lighthouse
GTC 2012
Fracture raytraced demo
Optix water demo
GTC 2013
Wave works (perfect storm demo)
Face works

Some minor things to say to roadmap:
*Maxwell PC GPU: well they said will have unified virtual addresing and that tegra 6 will have denver CPU and Maxwell GPU but Maxwell desktop GPUs will have a CPU? and in that case a Denver (i.e. some form of ARM64) one? I suspect yes as if not makes little sense to say unified virtual addresing.. but why they will not say directly on Maxwell PC product? also seems if it's a Q1/Q2 2014 product can be a bit premature..
also they forgot to say context switching and preemption on GPU what I'm expecting too..
*Volta GPU: well I expected to be called Einstein after all codename it was mentioned by Dally.. can be that a latter product or is a Echelon codename? anyway that would add also very efficient interconnects (perhaps 3-6X byte/s/watt improvements)

After all also seems a new DX is in the works with possible spec published this year.. question is if it will be on first Maxwell products or not.. i.e. in GM10x or GM11x products as early 2014 seems a little soon.. perhaps after all first maxwell could arrive H2 2014 and that could bring a good release with:
*Dx11.2 or DX12 support and equivalent OGL support (5.x?)
*Hierarchical two level warp dispatcher
*Unified register/shader mem L1/ pool
*Scalar ALU next to vector ALUs like AMD 7xxx series..
*CPU with UVA access..
*Context switching and GPU preemption

Posted in | No comments

Monday, 4 March 2013

My wishes for OCL 2.0

Posted on 14:28 by Unknown

Hi,
I think it's time for publishing my OpenCL 2 requests so they maybe get considered for inclusion into it:
I'm not requesting what it hopefully will be in it like C++ extensions etc..
Depending on wheter they plan for support on existing GPUs or not will determine if some of these can be included. Anyway getting a cl_khr or cl_ext extension would be good..
but just before it a good remainder of thing that still are to being implemented..

*starting to see cl_ext_device_fission implemented on GPUs that should be doable on AMD 7xxx GPUs but on NV GK110 still not? even better seems new AMD Sea Islands should support partitioning in up to 8 sub GPUs..
*Implementing new OCL 1.2 extensions like graphic ones MSAA and depth access..

For OpenCL 2 would be good to have:
*Atomic counters (cl_ext_atomic_counters_32) in core.. they provide an order of magnitude improvement vs global atomics at least on old D3D11 HW (Fermi, AMD 5xxx series) and are foundation of HW accelerated queues.
*Kernels can send interrupts to CPU and/or initiate host system calls.. that seems coming for a while I think even Fermi whitepaper suggested that but still no avaiable.. AMD SI support SEND_MSG in ISA as Lottes suggest in his blog so AMD should be able too..
*warp/wavefront vote functions: this functions are in NV HW since GTX 2xx (2008) useful for example in currently most better dynamic mem allocator for GPUs see "Fast Dynamic Memory Allocator for Massively Parallel Architectures" they said:
"The used hardware must provide a voting function for an effi cient implementation" thus seems and OpenCL port will need exposure of that..
*Dynamic parallelism: well that should be expected also now that GK110 is shipping and also seems SI could support some limited form of it as shown in a ADFS session..
*Named barriers: Well this is shipping in CUDA since Fermi days and can be used for warp specialization like in CUDADMA project that can bring better memory bandwith explotation in some apps and also as shown in HPP study can bring support for "true function composability" i.e. GPU functions that use barriers can call other GPU functions that use barriers without breaking expected usage see HPP paper by Gaster et al.
*Crossvendor MultiGPU like CUDA P2P functionality: i.e. memory from one GPU addressable by other GPU directly from kernel without previous copy (also present in cl_amd_bus_addressable in AMD OCL)
*Exposing some common intra warp/wavefront ops? (like existing NV shuffle.. makes sense more like median, min/max could be beneficial for platforms like Xeon Phi but not on GPUs)
*Expose some cross vendor multimedia extension ISAs? i.e. some common cl_amd_media_ops/cl_amd_media_ops2 and ptx SIMD instructions.. this can be good jointly with interop with video encoders and encoders for accelerated video processing and even NV uses in their fast raytracing kernels..
*Finalize to bring parity vs exisiting compute exposure in graphics APIs like OGL 4.3/D3D 11 compute shaders: like said atomic counters where one thing..
->other being new gather4 instuctions..
->DispatchComputeIndirect: i.e. ability to launch kernel with size of workgroup total size being fetched from GPU mem.. it's more efficient for variable work kernels that depend on work generated by a previous kernel.. in this case we avoid a CPU trip but note that could be done with new Dynamic Parallelism so perhaps doesn't need to be exposed..
->Promote into core MSAA and depth extensions
->MipMap support like in CUDA 5
->compressed tex formats support
->a cross vendor extension for bindless support (assuming will get broad support in coming years)
->cross vendor ext for sparse texture/buffer support..

To finalize also exposing advanced control of ld/st operations such as cache modifiers and even using texture path (in GK110)..

Finally seems future GPUs could support unified register/local mem mem so explicit size control for better optimization could be good, also seems local mem could be allocated dynamicaly inside a kernel via extension to barrier function argument for better use of it so an extension to barrier operator could be good and also a scalar processor is present on recent archs so altough could be intended for executing common scalar code in kernel (extracted by a compiler) could also be exposed for direct programmability..

Coming not shortly(?) for me with atomic counter bringing possibly very fast queues and exposing all graphics functionality in kernels in OpenCL like said above primary targets to expose are rasterizer, z-buffer and rop functinality..
*the most interesting for me is exposing Z-buffer.. GPUDet papers shows an usage of it..
*exposing rasterizer what could be?: exposing perhaps via a generalized dynamic parallelism a funtion that takes a buffer or "geometry" to rasterize and some kernel that would be called in some specified grid size (tiles 8x8?) via dynamic parallelism.. all in all somewhat crazy seems..

More thoughts?

Posted in | No comments

Thursday, 28 February 2013

What I'm expecting from GTC..

Posted on 17:37 by Unknown

Well really I think I'm expecting to much altough in form of a lot of minor improvements in his software products (so I'm no expecting new architectures info (Maxwell) etc..), but anyway I have compiled a list of things so I can check later wheter NV is doing his work or not :-)
Of course it will be good even if all these pieces come in to place say over H1 2013..

*nvfx: new effects system open source, cross vendor support, etc.. was anounced at Siggraph and has an empty github site and also there is a talk at GTC so there is no better place and moment to upload to github.. This system also uses a more efficient OGL state management ext called NV_state_object better aligned to DX10-11 state managements via objects so seems also more like DSA management..
*Only consumer HW info may be GK114/6 archs info which may bring some new things as note even Titan has no DX11.1 profile support so hoping before Maxwell support says GK114 and such new 680 replacements must have it so there is some minor arch enhancement in graphics side..
Also can have one more thing.. see next point.. well say it briefly dynamic parallelism everywhere and from anywhere to anywhere.. (ANYWHERE={CUDA,OGL})
*new ogl exts: NV_state_object (DX11 like state objects) and some kind of dynamic parallelism for graphics APIs..
Regarding this is interesting there as there is a patent on it and it's about exposing dynamic parallelism in graphics world which implies OGL in near future i.e. graphics shaders can create new draw calls and put on the dispatch manager queue..
Also for completeness what's holding NV from exposing launch graphics from compute kernels and dispatch compute kernels from graphics shaders.. Note seems NV_state_object is much needed in two cases (CUDA->OGL and OGL->OGL dispatch draw call cases) as some state env is needed in these cases CPU apis are not useful as it's GPU work and default OGL state may not be useful..

Also please upload documentation on NV_GPU_shader5_memory_extended shipping in 313 drivers altough I suspect is for exposing cache modifiers to load store operations supported on CUDA already like load non cached,load cached, etc..
Note my previous post asking for NV to expose all compute functionality (ISA richness in this case) to OGL compute shaders via at least now lacking PTX ISA instructions and also via some asm() function (which is reserved already in GLSL and usable in OCL kernels in NV and even AMD!(this is new for me I found last month and you can use AMDIL altough I haven't been able in exposing clock cycle counter to work yet))
*Grid SDK: well I'm interested in frame capture APIs not cloud stuff.. related I see OGL support for NVENC is being implemented so some update for NVENC will be good..
*OpenGL SDK: well one seems overdue (exposing advanced usage cases of OGL 4.x features) and a tess sample was released soon this year.. One deferred+ sample would be good..
*Cg 3.2: I want glsl 4.3 support integrated into Cg for some things I'm working and Cg 3.1 is almost one year old.. also I think if support for cg compute shader is or not implemented ( as said in Cg language/runtime) will say much of wheter Cg is dead or not.. Also what about bindless texes in Cg?
*cuda 5.1 I suggested to NV team in late October equaling CUDA to OGL compute shader so support for compressed texes, depth textures, msaa textures (even depth ones..).. Note some of this are in OCL 1.2 exts release in SA 2012.. And also expose similar functionality to all remaining OCL 1.2 new exts in case support avaiable in HW or easy to do by runtime like terminate kernel, out of bounds stuff, memory initialization etc..
One thing that I forgot at the time:
Expose atomic counters (now are shipping on OGL compute world) on CUDA and OCL (like AMD does on OCL) this are equivalent to atomadd(ptr,1) but an order of magnitude faster than global atomics at least on Fermi (not know in Kepler) and they are the foundation of "hardware accelerated queues" not? I remember how when NV readied OGL 4.2 beta drivers atomic counters were slow and then after some month or so they get tremendous speedup and they deserved special instruction exposed in NV OGL assembly language..
*cuda compiler sdk seems is going final and I think will bring up to date to CUDA 5.1 or 6.0 whatever they may end naming new CUDA release. (hope also gets up to date LLVM/Clang integration so 3.2 and/or 3.3)
*cuda.lang: Well I want to play with these for a long time.. motivation well bring more an offline compilation model to CUDA like OpenCL and basically avoid needing in Windows VS installed for realtime compilation of CUDA kernels: could be useful to dynamic compilation of Optix shaders (like OpenRL) and also for research software of nvidia like CUDAraster, VoxelPipe, etc..
*Shipping all Physx stuff from last GTC and GDC into production:
->apex 1.3 (bring realtime fracture support done entirely on gpu to existing RGB support)
->physx 3.3 (rigid bodies on gpu and perhaps even fracture like APEX)
Hope at least by GDC which is later we will get all of these in beta form..
One anoying thing for me at least is that Physx GPU interop with graphics APIs isn't avaiable (altough yes in APEX).. which anoys me is that APEX is Physx under the hood so please also expose GPU buffers of result simulation of GPU modules like cloth, fluid, and soon rigid bodies..
*optix 3.1 preview-> bring some gk110 perf improvements.. seems current Optix doesn't exercise all potential judging from perf numbers on Nv forums (barely better than GTX 680?)..
*cuda roadmap nda discussion: Well it was anyway a surprise to see NV invited me to a NDA discussion of future roadmap (hope saying it isn't NDA :-)) at GTC.. I can't attend but I hope they will be talking about how to expose unified CPU/GPU in CUDA and potentially new ISA sm_40?
*volume render solution
*ocl 1.2 in drivers: well with OCL 2.0 spec coming perhaps at Siggraph it's time to implement OCL 1.2 in NV drivers? In time with new CUDA support?
*nsgiht 3.0 final and 3.1 preview: After GLSL native debug I want (really more than I need right now but anyways soon will need..) (VS2012 support, OGL 4.3 support with compute shaders, and my biggest desire is for a unified host and device debugging experience like that ships in Nsight Eclipse edition)
For Eclipse edition I hope they add single GPU debugging with software preemption much like her older brother and also OGL debugging with that basically GPU debugging is perfect for me on Windows and Linux and all that remains is GPU true software preemption..

Posted in | No comments

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Wednesday, 25 September 2013

What I expect from GCN2.0 and APU13 conference..

Wednesday, 20 March 2013

2013: a good year for new API revisions and launches?

Nvidia GTC thoughts: ARM,roadmap,demos..

Monday, 4 March 2013

My wishes for OCL 2.0

Thursday, 28 February 2013

What I'm expecting from GTC..

Popular Posts

Blog Archive

About Me