GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Wednesday, 25 September 2013

What I expect from GCN2.0 and APU13 conference..

Posted on 06:27 by Unknown

Hi,

today is the big day for AMD GPUs this year and seems we are getting a big die chip (due to made 28nm) to compete with Titan.. well may be is only that more ALUs+BW and some new ISAs but being named GCN 2.0 I expect at least this no commonly said things: (note that all of that apply to Maxwell GPUs and to some extent upcoming Intel Gen8 and 9 GPUs shame no news this year at IDF I'm only one that missed it)

Docs: I expect some Volcanic Islands ISA guide like Bonaire ISA doc was released earlier this year..

User software: NV ShadowPlay competition

Dev software: Hope as said new HSA SDK comes this year at least for APU13 conf.. seems OCL 2.0 beta build are premature but hope AMD OCL driver finally exposes SPIR+ new 1.2 graphics extensions like msaa and depth ones.. and perhaps even mipmaps CL 2.0 one even before CL 2.0 betas.. also would like to see HW GPU debugging of OCL kernels added to their software..

I expect some Media SDK release that includes VCE usage plus fast framebuffer capture as said last year HSA conf.. so basically similar to Grid SDK let's see if they make it public without NDAs..

GPU:

obvious(?):nested parallelism (aka CUDA dyn par), SVM (also coming to Maxwell)

not so obvious(? well AMD said some of that for 2014 in roadmaps let's see):

HW GPU debugging: like since pre-Fermi days on NV! note Linux support requires latest sm_35 devices currently.. which takes advantage of next feature:

HW context switching so we avoid GPU resets for long shaders, we could get some GPU process explorer where we can kill processes perhaps but well this seems better to be included as GPU tab in Win9(?).. seems WDDM 1.2 supports GPU drv report context switch granularity I expect GCN 2.0 and Maxwell to expose up to HW ISA instruction (well

texture support: support astc format

video decode: add 4K HEVC decode and expose via new DXVA 2.0 extension and also to XvBA and also to VDPAU and VAAPI now that radeon OSS drivers support video decode..

note in this case for NV I want also VDPAU support..

VCE: up to 4k encode (h.264 only not hevc I expect) so that with upcoming Nvidia ShadowPlay like support for AMD GPUs in RadeonPro infinity we can capture

Mac world:

Note also that MacOs currently lacks HW GPU debugging on NV GPUs in Nsight Eclipse for Mac.. let's see how evolves but hope that Macos 11 or 10.10 adds support for all these new features so basically HEVC HW decode on GPUs that support hope that finally exposes publicily HW encode APIs as no Quicksync nor NVENC for Mac currently.. also hope Astc OGL extension and some Xcode support for GPU HW debugging of OCL kernels on new GPUs that would be awesome..

as said also expose some GPU tab to process explorer with GPU process kill etc..

Related APU13 sessions are posted don't know if all I'm listing here ones I found interesting from AMD workers almost all:

(note seems also no OGL session this year but would be good that we are missing some new GPU features that will be implemented somewhat later perhaps optimized sparse textures and AMD_sparse_buffer?)

The HSA System Architecture Requirements: An overview

AMD Neural Networks Library

HSA Compilers Technology

WT-4070 unity webcl and webgl dx11 level?

ANGLE and cross-platform WebGL support es 3.0 status

AMD Video Compression Engine: The Route towards Low-Latency Cloud Gaming Solutions

Optimizing FFMPEG and Handbrake Using OpenCL™ and Other AMD HW Capabilities

An Introduction to SPIR for OpenCL™ Application Developers and Compiler Developers

Accelerating and Evaluating OpenCL™ Graph Applications

Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms

Automated CUDA-to-OpenCL™ Translation with CU2CL: What’s Next?

Optimizing Raytracing on GCN with AMD Development Tools

Designing a game audio engine for HSA

A Crash Course on the AMD GCN Architecture

Direct Compute in Gaming

Accelerate Crypto Applications with AMD Platform Security Processor

AMD Content Solution Enablement Strategy

LibreOffice Calc Spreadsheet Formulae Optimizations using OpenCL

I expect from these sessions:

Optimizing Raytracing on GCN with AMD Development Tools: some library released to compete with Optix

AMD Neural Networks Library: more libraries to play

Posted in | No comments

Wednesday, 20 March 2013

2013: a good year for new API revisions and launches?

Posted on 11:05 by Unknown

Hi I was thinking this year could probably be a year with a lot of new specs published related to GPU world..
just released:
*OpenMP 4.0rc2 with accel device targets added
*OpenACC 2.0 brings GK110 cuda 5.0 features to directives world..
I expect OpenACC 2.0 support PGI imp in June which will also give support for MIC and AMD via OpenCL (so Intel HD5000 too in Windows version?)
also perhaps CAPS for summer..
Regarding OpenMP 4 I see Intel Fortran update with SIMD features already and perhaps some beta by summer for Intel CPUs (and Xeon Phi?) maybe production quality at SC13 both Intel compilers 2014 and some PGI beta..
Not so HPC world related, let's see:
*WebCL final spec should be coming soon.. let's see how fast they progress.. I would like to see optional 1.2 support with all new SA2012 exts for MSAA and depth sharing.. what troubles me is OpenCL extension mechanism for getting function address..
*WebGL 2.0 should be announced exposing OGL ES 3.0 to WebGL world.. I think ANGLE support is a year lagging.. I still would want a version with "destkop profile" with core support and number version setting (like 4.2 core) and optional extensions usage if wanted..
*OpenCL 2.0 for Siggraph?
*Dx 12? for Windows blue alpha this year at PDC (Oct-Nov 2012) I hope is a major revision spec with new DirectCompute features (recursion, pointers,etc..)..
*C++ AMP 2.0? new C++ AMP with function calls recursion separate compilation and pointer support would be good..
Khronos also is expected this year to bring:
*StreamInput for unified LeapMotion, Occulus and Kinect support among other things..
*OpenCV like api
Also next year with Maxwell expected DX12 support:
*OGL 5
*CUDA 6
and for mobile platforms we expect at least OGL ES 4.0..

Posted in | No comments

Nvidia GTC thoughts: ARM,roadmap,demos..

Posted on 06:24 by Unknown

Last day anouncement of full OGL 4.3 support on Tegra products on ARM is huge..
I assume also Nvidia will bring all his supported extensions such bindless exts,VBUM(pointer support in GLSL), direct state access etc.. and not only core OGL 4.3..
also related Nvidia exposes some features not exposed in DX11 like support for writes to image2DMS objects will be like RWTexture2DMS if that existed on DX11 (note only RWtexture2d and texture2dms exist)..
Nvidia showed on Ubuntu and one question will be if they will/can? bring all that goodness to Android world.. I suspect at least if not they will bring via a lot of extensions to OGL ES driver similar with what they are doing to Tegra4 this year..
Also note they shown 319 drivers series and hope this will bring EGL support in NV drivers for Linux at least (related to Mir on Ubuntu efforts) and not only EGL OGL ES support like AMD currently but full OGL profile in EGL and as said for Windows too.. note I'm expecting also Optimus in Linux world soon so that could be the series..
note Nvidia showed a demo running Optix on ARM (on his blog says 1day port) and also CUDA 5.0 demos and finally IslandGL well this is a port of one of Fermi DX11 demos ISland which is avaiable to download as demo on Nvidia and features highly tesselation usage.. with that comes clear why renewed interested in NV to bring an OGL 4.0 SDK.. hope almost al techniques in NV DX11 SDK get ported as stocastich transparency to name one..
with that I recap tooling support so we experience seamless transition from PC world to Tegra world..
*Cg for ARM: bring Cg to ARM world or at least and offline GLSL 4.3 compiler that cgc has..
*OGL 4.0 SDK for ARM: as said new OGL SDK is in the works and a ARM version should be good
*OpenCL for ARM: Nvidia doesn't talk about it but I think market pressure will force them to port to ARM also.. note still no OCL 1.2 with new exts in end 2012
*TXAA: With OGL 4.3 and AAA games becoming norm in Android and IOS markets is a matter of time until one wants TXAA that tegra5 will have exposed in OGL..
*Optix: well shown so ported..
*Physx and Apex: Physx is for Android shipping interesting will be to see if they enable GPU support once Tegra5 ships with CUDA GPU (I suspect yes expecting 128/192 shaders and that also had G80 which they supported).. more interesting will be to see if Apex (right windows world only) get ported as that brings turbulence /GPU rigid bodies with fracturing the former one which is being used a lot lately in f2p games..

*Also note that will bring full DX11 to Windows on ARM products which is impressive.. I'm not certain if DX11.1 or not as GPU shipping in Kayla could be GK208 a 3rd gen Kepler says press so that feature could be in..

Lastly an slightly unrelated I would like that NV released her impressive demos of last two/three years as executables so we can test in house like:
GTC 2010
Lighthouse
GTC 2012
Fracture raytraced demo
Optix water demo
GTC 2013
Wave works (perfect storm demo)
Face works

Some minor things to say to roadmap:
*Maxwell PC GPU: well they said will have unified virtual addresing and that tegra 6 will have denver CPU and Maxwell GPU but Maxwell desktop GPUs will have a CPU? and in that case a Denver (i.e. some form of ARM64) one? I suspect yes as if not makes little sense to say unified virtual addresing.. but why they will not say directly on Maxwell PC product? also seems if it's a Q1/Q2 2014 product can be a bit premature..
also they forgot to say context switching and preemption on GPU what I'm expecting too..
*Volta GPU: well I expected to be called Einstein after all codename it was mentioned by Dally.. can be that a latter product or is a Echelon codename? anyway that would add also very efficient interconnects (perhaps 3-6X byte/s/watt improvements)

After all also seems a new DX is in the works with possible spec published this year.. question is if it will be on first Maxwell products or not.. i.e. in GM10x or GM11x products as early 2014 seems a little soon.. perhaps after all first maxwell could arrive H2 2014 and that could bring a good release with:
*Dx11.2 or DX12 support and equivalent OGL support (5.x?)
*Hierarchical two level warp dispatcher
*Unified register/shader mem L1/ pool
*Scalar ALU next to vector ALUs like AMD 7xxx series..
*CPU with UVA access..
*Context switching and GPU preemption

Posted in | No comments

Monday, 4 March 2013

My wishes for OCL 2.0

Posted on 14:28 by Unknown

Hi,
I think it's time for publishing my OpenCL 2 requests so they maybe get considered for inclusion into it:
I'm not requesting what it hopefully will be in it like C++ extensions etc..
Depending on wheter they plan for support on existing GPUs or not will determine if some of these can be included. Anyway getting a cl_khr or cl_ext extension would be good..
but just before it a good remainder of thing that still are to being implemented..

*starting to see cl_ext_device_fission implemented on GPUs that should be doable on AMD 7xxx GPUs but on NV GK110 still not? even better seems new AMD Sea Islands should support partitioning in up to 8 sub GPUs..
*Implementing new OCL 1.2 extensions like graphic ones MSAA and depth access..

For OpenCL 2 would be good to have:
*Atomic counters (cl_ext_atomic_counters_32) in core.. they provide an order of magnitude improvement vs global atomics at least on old D3D11 HW (Fermi, AMD 5xxx series) and are foundation of HW accelerated queues.
*Kernels can send interrupts to CPU and/or initiate host system calls.. that seems coming for a while I think even Fermi whitepaper suggested that but still no avaiable.. AMD SI support SEND_MSG in ISA as Lottes suggest in his blog so AMD should be able too..
*warp/wavefront vote functions: this functions are in NV HW since GTX 2xx (2008) useful for example in currently most better dynamic mem allocator for GPUs see "Fast Dynamic Memory Allocator for Massively Parallel Architectures" they said:
"The used hardware must provide a voting function for an effi cient implementation" thus seems and OpenCL port will need exposure of that..
*Dynamic parallelism: well that should be expected also now that GK110 is shipping and also seems SI could support some limited form of it as shown in a ADFS session..
*Named barriers: Well this is shipping in CUDA since Fermi days and can be used for warp specialization like in CUDADMA project that can bring better memory bandwith explotation in some apps and also as shown in HPP study can bring support for "true function composability" i.e. GPU functions that use barriers can call other GPU functions that use barriers without breaking expected usage see HPP paper by Gaster et al.
*Crossvendor MultiGPU like CUDA P2P functionality: i.e. memory from one GPU addressable by other GPU directly from kernel without previous copy (also present in cl_amd_bus_addressable in AMD OCL)
*Exposing some common intra warp/wavefront ops? (like existing NV shuffle.. makes sense more like median, min/max could be beneficial for platforms like Xeon Phi but not on GPUs)
*Expose some cross vendor multimedia extension ISAs? i.e. some common cl_amd_media_ops/cl_amd_media_ops2 and ptx SIMD instructions.. this can be good jointly with interop with video encoders and encoders for accelerated video processing and even NV uses in their fast raytracing kernels..
*Finalize to bring parity vs exisiting compute exposure in graphics APIs like OGL 4.3/D3D 11 compute shaders: like said atomic counters where one thing..
->other being new gather4 instuctions..
->DispatchComputeIndirect: i.e. ability to launch kernel with size of workgroup total size being fetched from GPU mem.. it's more efficient for variable work kernels that depend on work generated by a previous kernel.. in this case we avoid a CPU trip but note that could be done with new Dynamic Parallelism so perhaps doesn't need to be exposed..
->Promote into core MSAA and depth extensions
->MipMap support like in CUDA 5
->compressed tex formats support
->a cross vendor extension for bindless support (assuming will get broad support in coming years)
->cross vendor ext for sparse texture/buffer support..

To finalize also exposing advanced control of ld/st operations such as cache modifiers and even using texture path (in GK110)..

Finally seems future GPUs could support unified register/local mem mem so explicit size control for better optimization could be good, also seems local mem could be allocated dynamicaly inside a kernel via extension to barrier function argument for better use of it so an extension to barrier operator could be good and also a scalar processor is present on recent archs so altough could be intended for executing common scalar code in kernel (extracted by a compiler) could also be exposed for direct programmability..

Coming not shortly(?) for me with atomic counter bringing possibly very fast queues and exposing all graphics functionality in kernels in OpenCL like said above primary targets to expose are rasterizer, z-buffer and rop functinality..
*the most interesting for me is exposing Z-buffer.. GPUDet papers shows an usage of it..
*exposing rasterizer what could be?: exposing perhaps via a generalized dynamic parallelism a funtion that takes a buffer or "geometry" to rasterize and some kernel that would be called in some specified grid size (tiles 8x8?) via dynamic parallelism.. all in all somewhat crazy seems..

More thoughts?

Posted in | No comments

Thursday, 28 February 2013

What I'm expecting from GTC..

Posted on 17:37 by Unknown

Well really I think I'm expecting to much altough in form of a lot of minor improvements in his software products (so I'm no expecting new architectures info (Maxwell) etc..), but anyway I have compiled a list of things so I can check later wheter NV is doing his work or not :-)
Of course it will be good even if all these pieces come in to place say over H1 2013..

*nvfx: new effects system open source, cross vendor support, etc.. was anounced at Siggraph and has an empty github site and also there is a talk at GTC so there is no better place and moment to upload to github.. This system also uses a more efficient OGL state management ext called NV_state_object better aligned to DX10-11 state managements via objects so seems also more like DSA management..
*Only consumer HW info may be GK114/6 archs info which may bring some new things as note even Titan has no DX11.1 profile support so hoping before Maxwell support says GK114 and such new 680 replacements must have it so there is some minor arch enhancement in graphics side..
Also can have one more thing.. see next point.. well say it briefly dynamic parallelism everywhere and from anywhere to anywhere.. (ANYWHERE={CUDA,OGL})
*new ogl exts: NV_state_object (DX11 like state objects) and some kind of dynamic parallelism for graphics APIs..
Regarding this is interesting there as there is a patent on it and it's about exposing dynamic parallelism in graphics world which implies OGL in near future i.e. graphics shaders can create new draw calls and put on the dispatch manager queue..
Also for completeness what's holding NV from exposing launch graphics from compute kernels and dispatch compute kernels from graphics shaders.. Note seems NV_state_object is much needed in two cases (CUDA->OGL and OGL->OGL dispatch draw call cases) as some state env is needed in these cases CPU apis are not useful as it's GPU work and default OGL state may not be useful..

Also please upload documentation on NV_GPU_shader5_memory_extended shipping in 313 drivers altough I suspect is for exposing cache modifiers to load store operations supported on CUDA already like load non cached,load cached, etc..
Note my previous post asking for NV to expose all compute functionality (ISA richness in this case) to OGL compute shaders via at least now lacking PTX ISA instructions and also via some asm() function (which is reserved already in GLSL and usable in OCL kernels in NV and even AMD!(this is new for me I found last month and you can use AMDIL altough I haven't been able in exposing clock cycle counter to work yet))
*Grid SDK: well I'm interested in frame capture APIs not cloud stuff.. related I see OGL support for NVENC is being implemented so some update for NVENC will be good..
*OpenGL SDK: well one seems overdue (exposing advanced usage cases of OGL 4.x features) and a tess sample was released soon this year.. One deferred+ sample would be good..
*Cg 3.2: I want glsl 4.3 support integrated into Cg for some things I'm working and Cg 3.1 is almost one year old.. also I think if support for cg compute shader is or not implemented ( as said in Cg language/runtime) will say much of wheter Cg is dead or not.. Also what about bindless texes in Cg?
*cuda 5.1 I suggested to NV team in late October equaling CUDA to OGL compute shader so support for compressed texes, depth textures, msaa textures (even depth ones..).. Note some of this are in OCL 1.2 exts release in SA 2012.. And also expose similar functionality to all remaining OCL 1.2 new exts in case support avaiable in HW or easy to do by runtime like terminate kernel, out of bounds stuff, memory initialization etc..
One thing that I forgot at the time:
Expose atomic counters (now are shipping on OGL compute world) on CUDA and OCL (like AMD does on OCL) this are equivalent to atomadd(ptr,1) but an order of magnitude faster than global atomics at least on Fermi (not know in Kepler) and they are the foundation of "hardware accelerated queues" not? I remember how when NV readied OGL 4.2 beta drivers atomic counters were slow and then after some month or so they get tremendous speedup and they deserved special instruction exposed in NV OGL assembly language..
*cuda compiler sdk seems is going final and I think will bring up to date to CUDA 5.1 or 6.0 whatever they may end naming new CUDA release. (hope also gets up to date LLVM/Clang integration so 3.2 and/or 3.3)
*cuda.lang: Well I want to play with these for a long time.. motivation well bring more an offline compilation model to CUDA like OpenCL and basically avoid needing in Windows VS installed for realtime compilation of CUDA kernels: could be useful to dynamic compilation of Optix shaders (like OpenRL) and also for research software of nvidia like CUDAraster, VoxelPipe, etc..
*Shipping all Physx stuff from last GTC and GDC into production:
->apex 1.3 (bring realtime fracture support done entirely on gpu to existing RGB support)
->physx 3.3 (rigid bodies on gpu and perhaps even fracture like APEX)
Hope at least by GDC which is later we will get all of these in beta form..
One anoying thing for me at least is that Physx GPU interop with graphics APIs isn't avaiable (altough yes in APEX).. which anoys me is that APEX is Physx under the hood so please also expose GPU buffers of result simulation of GPU modules like cloth, fluid, and soon rigid bodies..
*optix 3.1 preview-> bring some gk110 perf improvements.. seems current Optix doesn't exercise all potential judging from perf numbers on Nv forums (barely better than GTX 680?)..
*cuda roadmap nda discussion: Well it was anyway a surprise to see NV invited me to a NDA discussion of future roadmap (hope saying it isn't NDA :-)) at GTC.. I can't attend but I hope they will be talking about how to expose unified CPU/GPU in CUDA and potentially new ISA sm_40?
*volume render solution
*ocl 1.2 in drivers: well with OCL 2.0 spec coming perhaps at Siggraph it's time to implement OCL 1.2 in NV drivers? In time with new CUDA support?
*nsgiht 3.0 final and 3.1 preview: After GLSL native debug I want (really more than I need right now but anyways soon will need..) (VS2012 support, OGL 4.3 support with compute shaders, and my biggest desire is for a unified host and device debugging experience like that ships in Nsight Eclipse edition)
For Eclipse edition I hope they add single GPU debugging with software preemption much like her older brother and also OGL debugging with that basically GPU debugging is perfect for me on Windows and Linux and all that remains is GPU true software preemption..

Posted in | No comments

Monday, 10 December 2012

Exposing advanced PTX ISA in OGL compute..

Posted on 18:29 by Unknown

Well that is an extended answer to Timothy post in http://timothylottes.blogspot.com.es/2012/12/storing-objects-on-gpu.html ..

Hi Tim,
well there is a lot of compute stuff not present currently on OGL compute shaders.. you have really opened Pandora's box here..
well seriously..
first please apologize for extension of comment but hope that's motivates Nvidia OGL team to implement "simple" way of exposing "all" current ISA functionality compared to PTX level...
also before you get tired and bored of reading I would like to ask how you to confirm you measure graphics->compute->graphics context switch is simply something as Draw(); DispatchCompute();
Draw(); with empty computekernel as you say.. also should something as DrawIndirect(); DispatchIndirect(); DrawIndirect(); improve/lower timings of contex switching time?..
As you say some native instructions aren't present and also some special registers (in PTX parlance chapter 9 at ptx manual) but also some CUDA runtime level functionality..
I'm going to expose now is use cases of low level PTX isa instructions and special registers are used presently in very low-level optimized GPGPU codes most originated from Nvidia research guys..
After all that examples my conclusion is IMHO, as Nvidia currently supports assembly level shaders (NV_gpu_shader5 NV_gpu_compute_shader5) and even CUDA doesn't take
effort to expose all things from PTX world to "high level" CUDA C kernels NV should follow CUDA approach and should do two things:
1. expose all PTX insts+special registers currently not exposed to NV assembly shaders in it.. may be GL_NV_compute_program5_extended or something like these..
the idea here is it's that overkill to do something similar to NV_shader_atomic_float extension for exposing atomic floats (altough this extension does more and defines
GLSL functions in addition to ISA ADD.F32 variant in shader assembly:
float atomicAdd(float *address, float data);
but this limitation can also be avoided by adding only another extension..
2. expose as motivated below a asm() function in OGL shaders (like NV_gpu_inline_asm) that work exactly like currently supported CUDA kernels (and even OpenCL! since 4.x(1?) drivers)..
   note that's the way that OpenCL is able to exploit as much as CUDA new GPU cores ISA enhancements..
   For OpenCL see inlinePTX kernel code ->asm("mov.u32 %0, %%laneid;" : "=r"(laneid));
with that as said NV_shader_atomic_float extension should be not needed using:
#pragma extension NV_gpu_inline_asm
float atomicAdd(float *address, float data)
{
{ float v; asm("ATOM.ADD.F32 %0, %1, %2;" : "=r"(v) : "r"(adress), "r"(data)); return v; }
}
some low level optimized CUDA codes that come to my mind are from NV reesearch as cudaraster, raytracing,and also http://code.google.com/p/cudadma/ these are examples that show
that possiblity of making some optimized ports of these codes is tied to accessing being able to access these low level ins as you found on MGPUsort..

*Cuda dma:
well this code uses generalized named barriers which allow also for not locking between diferent barriers (SM20 and up) and think counting threads passed and waiting at
barrier(syncthreads_count()).. see bar in PTX manual, bar.arrive and example PTX code..
also project home now shows GK110 instruction usage->
" The new instances will make use of the ldg intrinsic for supporting many more outstanding loads in flight by issuing them through the texture cache"
*cudaraster/raytracing/voxelpipe NVR projects (please note these are specially interesting as they are related to forward looking graphics/compute integration and implementation in OGL compute should allow tighter integration
and possibly better improvements to graphics assuming someday hardware queues between stages get exposed in OpenGL):
we find code like these:
__declspec(__device__) __inline U32   getLaneMaskLt           (void)                  { U32 r; asm("mov.u32 %0, %lanemask_lt;" : "=r"(r)); return r; }
->latest version optimzed for kepler also uses new vector simd instructions (see hpg2012 poster)
they seem to define also prmt:
__declspec(__device__) __inline U32   prmt     (U32 a, U32 b, U32 c)   { U32 v; asm("prmt.b32 %0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
__declspec(__device__) __inline U32   add_cc                  (U32 a, U32 b)          { U32 v; asm("add.cc.u32 %0, %1, %2;" : "=r"(v) : "r"(a), "r"(b)); return v; }
add with carry for bignum computations one thing I'm sure is that addc isn't exposed in variant where carry is stored in special carry flag bit (CC.CF)
but not sure if graphics shaders expose currently (think yes but carry must be saved as integer and then added so this sequence for 128 bit addition isn't as efficient)
add.cc.u32 x1,y1,z1; // extended-precision addition of
addc.cc.u32 x2,y2,z2; // two 128-bit values
addc.cc.u32 x3,y3,z3;
addc.u32 x4,y4,z4;

Another interesting instruction used in some papers is bypassing or forcing caching loads per instruction etc(8.7.6.1 Cache Operators).. found in cudaraster also:
__declspec(__device__) __inline uint4 cachedLoad              (const uint4* p)        { uint4 v; asm("ld.global.ca.v4.u32 {%0, %1, %2, %3}, [%4];" : "=r"(v.x), "=r"(v.y), "=r"(v.z), "=r"(v.w) : "l"(p)); return v; }
__declspec(__device__) __inline void cachedStore             (U32* p, U32 v)         { asm("st.global.wb.u32 [%0], %1;" :: "l"(p), "r"(v)); }
__declspec(__device__) __inline U32   uncachedLoad            (const U32* p)          { U32 v; asm("ld.global.cg.u32 %0, [%1];" : "=r"(v) : "l"(p)); return v; }
__declspec(__device__) __inline void uncachedStore           (uint4* p, uint4 v)     { asm("st.global.cg.v4.u32 [%0], {%1, %2, %3, %4};" :: "l"(p), "r"(v.x), "r"(v.y), "r"(v.z), "r"(v.w)); }
Finally some codes use %clock and %clock64 variant register for true kernel cycles counting..
We only need to see PTX manual as said for me interesting lacking instructions are (sorry if something is already present):

mul24 (specially useful for thread index,sad(for video operations),bfind,popc, popcll (like clock and others we have here 64bit instruction which allows 2x faster)
The last one was shown also of good use in computer vision (object recognition) http://nvidia.fullviewmedia.com/gtc2010/0923-c-2209.html) sadly no slides only video but
new kepler shufl instructions which NV said improved perf of galaxy code shown in GTC..
and special registers:
%lanemask_eq, %lanemask_le, %lanemask_lt, %lanemask_ge, %lanemask_gt
%clock, %clock64
I'm not even asking about more general features in kernels features such as dynamic malloc, printf support, and true function call in kernels etc..
Well really function call and recursion stuff is a wonderful example of how really simple is to
add support for things by adding only assembly language extensions and using asm blocks without changing GLSL parser :
asm(
".func foo { call foo; .. ret; } … call foo;")
and not even advanced CUDA runtime compute features such as:
*tuning L1 cache size vs shared mem size (16/48 32/32 48/16 currenlty),
*how to expose concurrent kernels (really usefull in sm_35) to OpenGL without stream model
*host pinned mem (use cpu mem from kernel on the fly without transfers) ->should enable out of core compute algorithms..
(that's also interesting as AMD has one extension for it GL_AMD_pinned_memory which works even on discrete GPUs and searching while ago
from nv found NV_sysmem_buffer but hacking to code to use it seems non functional)
*P2P functionality of accesing from kernels mem from other GPUs etc..
(note this is another interesting thing as AMD currently exposes that in OGL world GL_AMD_bus_addressable_memory but seems FirePro only so that would
be possible once GL 4.3 drivers ship also they enable that in OCL world this month with CL_AMD_bus_....) they call it DirectGMA..
Also lacking vs CUD/OpenCL but that should be addressed by OpenGL ARB as isn't NV specific
is allow to programatically exposing work group size like by requiring local_size_x variables not to be constant like (new extension like GL_NV_uniform_local_size)

uniform int local_size_x_from_app;
layout (local_size_x = local_size_x_from_app) in;

or better new function
DispatchComputeSetLocalSize(globx,globy,globz,locx ,locy,locz)
equivalent of CUDA support

and finally also some needed for serious compute similar to usage HPC programs is allowing unroll marks to compiler to unroll selected loops
#pragma unroll
--END of answer--

thah shows in some aspects OGL compute support exposed in 4.3 is really very simple somewhat like CUDA 1.0 and even worse in some aspects specially in regard to exposing ISA of current NV compute cores..
For example some CUDA codes use %clock to measure at cycle count of their kernels.. that's supposedly was shipping in CUDA 1.0 altough using PTX code.. today we can use asm() blocks to define a clock function in CUDA C code and use that similar to CPU world using asm() blocks around TSC counter..

Posted in | No comments

Saturday, 10 July 2010

Some news!

Posted on 11:16 by Unknown

News:
*Gpu computing gems 1 or GPU gems 4 source code already avaiable in gpucomputing.net:
Book for November..
Right now:

Title
A Programmable Graphics Pipeline in CUDA for Order Independent Transparency	1 new	07-10-2010
High Performance Iterated Function Systems	0 new	07-02-2010
CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm	0 new	07-01-2010
Connected Component Labeling in CUDA - demo+code	0 new	06-30-2010
A Practical Guide toMassively ParallelMonte Carlo Simulations: The Ising Model	0 new	06-30-2010
Parallel LDPC Decoding using CUDA	0 new	06-30-2010
Path Regeneration for Random Walks	0 new	06-30-2010
GPU Gems 4: Deformable Volumetric Registration using B-splines Source Code	0 new	06-30-2010
Monte Carlo Photon Transport on the GPU	0 new	06-30-2010
Lattice-Boltzmann Lighting Models - Source Code	0 new	06-30-2010
RNA folding GPU	0 new	06-30-2010
Haar Classifiers for Object Detection with CUDA: Pixel-parallel processing kernel	0 new	06-29-2010
Multiclass Support Vector Machine	0 new	06-29-2010
Parallelization of the x264 encoder using OpenCL	0 new	06-21-2010
Cone-Beam CT image reconstruction using the Katsevich Algorithm	0 new	06-21-2010
Line forward projection on CUDA	0 new	06-11-2010

seems MareNostrum getting a rack of Fermis perhaps with IBM Power7

see now Nvidia would have to publish a PowerPC arch CUDA driver?

Or using PathScale with full open source based computing stack..
avaiable here branch from noveau:

http://github.com/pathscale/pscnv/commits/master

Seems Nvidia TCC supporting driver Fermi in IBM web site version 197.81

Catalyst 10.8 beta seems avaiable 10.7 coming 21/7..

Physx 3.0 coming with CPU improvements:
*auto threading
*sse enabled by default
Mafia has new runtimes NVIDIA PhysX driver: 10.04.02_9.10.0522.
Mueller has post paper of Fermi launch demo using water heigh fields plus particles..
Two other papers interesting from Nvidia research are:

HLBVH: Hierarchical LBVH Construction for Real-Time Ray Tracing
PantaRay: Fast Ray-traced Occlusion Caching of Massive Scenes

Hwu based course from Stanford:
http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Two interesting conferences program avaiable:

PACT
has intel gpu paper demystifying ..
also Revisiting Sorting for GPGPU Stream Architectures
which achieves near 500mkeys/s on gt200..

there is a workshop on gpus
http://informatik.technikum-wien.at/gpusca/
and web doesn't work.

The Nineteenth International Conference on
Parallel Architectures and Compilation Techniques (PACT)
Vienna, Austria, September 11-15, 2010

Interesting papers:
Scalable Thread Scheduling and Global Power Management for Heterogeneous Many-Core Architectures
Dynamically Managed Multithreaded Reconfigurable Architectures for Chip Multiprocessors
WAYPOINT: Scaling Coherence to Thousand-core Architectures
Scalable Hardware Support for Conditional Parallelization
Less is More: Trading off Work-Efficiency for Scalability in Irregular Programs
Revisiting Sorting for GPGPU Stream Architectures
D. Merrill, A. Grimshaw
An Integer Programming Framework for Optimizing Shared Memory Use on GPUs
W. Ma, G. Agrawal
DMATiler: Revisiting Loop Tiling for Direct Memory Access
A Software-SVM-based Transactional Memory for Multicore Accelerator Architectures with Local Memory
Automatic Vector Instruction Selection for Dynamic Compilation
An OpenCL Framework for Heterogeneous Multicores with Local Memory

SC10

I would like to review this papers:
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
Parallel Fast Gauss Transform
Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers
The Multi-Scale Heart Simulation on Massively Parallel Computers
Using 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
An 80-Fold Speedup, 15.0 TFlops, Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code
Exploiting 162-Nanosecond End-to-End Communication Latency on Anton
Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories
Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Scalable Graph Exploration on Multicore Processors
The 48-core SCC processor: the programmer’s view
Exploring a Novel Gathering Method for Finite Element Codes on the Cell/B.E. Architecture
Reducing Multicore Bandwidth Requirements for Combinatorial Multigrid
Diagnosis, Tuning and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method
Scaling Hierarchical N-Body Simulations on GPU Clusters
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance
The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches

Posted in | No comments

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Wednesday, 25 September 2013

What I expect from GCN2.0 and APU13 conference..

Wednesday, 20 March 2013

2013: a good year for new API revisions and launches?

Nvidia GTC thoughts: ARM,roadmap,demos..

Monday, 4 March 2013

My wishes for OCL 2.0

Thursday, 28 February 2013

What I'm expecting from GTC..

Monday, 10 December 2012

Exposing advanced PTX ISA in OGL compute..

Saturday, 10 July 2010

Some news!

Popular Posts

Blog Archive

About Me