Exposing advanced PTX ISA in OGL compute.. ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Well that is an extended answer to Timothy post in http://timothylottes.blogspot.com.es/2012/12/storing-objects-on-gpu.html ..

Hi Tim,
well there is a lot of compute stuff not present currently on OGL compute shaders.. you have really opened Pandora's box here..
well seriously..
first please apologize for extension of comment but hope that's motivates Nvidia OGL team to implement "simple" way of exposing "all" current ISA functionality compared to PTX level...
also before you get tired and bored of reading I would like to ask how you to confirm you measure graphics->compute->graphics context switch is simply something as Draw(); DispatchCompute();
Draw(); with empty computekernel as you say.. also should something as DrawIndirect(); DispatchIndirect(); DrawIndirect(); improve/lower timings of contex switching time?..
As you say some native instructions aren't present and also some special registers (in PTX parlance chapter 9 at ptx manual) but also some CUDA runtime level functionality..
I'm going to expose now is use cases of low level PTX isa instructions and special registers are used presently in very low-level optimized GPGPU codes most originated from Nvidia research guys..
After all that examples my conclusion is IMHO, as Nvidia currently supports assembly level shaders (NV_gpu_shader5 NV_gpu_compute_shader5) and even CUDA doesn't take
effort to expose all things from PTX world to "high level" CUDA C kernels NV should follow CUDA approach and should do two things:
1. expose all PTX insts+special registers currently not exposed to NV assembly shaders in it.. may be GL_NV_compute_program5_extended or something like these..
the idea here is it's that overkill to do something similar to NV_shader_atomic_float extension for exposing atomic floats (altough this extension does more and defines
GLSL functions in addition to ISA ADD.F32 variant in shader assembly:
float atomicAdd(float *address, float data);
but this limitation can also be avoided by adding only another extension..
2. expose as motivated below a asm() function in OGL shaders (like NV_gpu_inline_asm) that work exactly like currently supported CUDA kernels (and even OpenCL! since 4.x(1?) drivers)..
   note that's the way that OpenCL is able to exploit as much as CUDA new GPU cores ISA enhancements..
   For OpenCL see inlinePTX kernel code ->asm("mov.u32 %0, %%laneid;" : "=r"(laneid));
with that as said NV_shader_atomic_float extension should be not needed using:
#pragma extension NV_gpu_inline_asm
float atomicAdd(float *address, float data)
{
{ float v; asm("ATOM.ADD.F32 %0, %1, %2;" : "=r"(v) : "r"(adress), "r"(data)); return v; }
}
some low level optimized CUDA codes that come to my mind are from NV reesearch as cudaraster, raytracing,and also http://code.google.com/p/cudadma/ these are examples that show
that possiblity of making some optimized ports of these codes is tied to accessing being able to access these low level ins as you found on MGPUsort..

*Cuda dma:
well this code uses generalized named barriers which allow also for not locking between diferent barriers (SM20 and up) and think counting threads passed and waiting at
barrier(syncthreads_count()).. see bar in PTX manual, bar.arrive and example PTX code..
also project home now shows GK110 instruction usage->
" The new instances will make use of the ldg intrinsic for supporting many more outstanding loads in flight by issuing them through the texture cache"
*cudaraster/raytracing/voxelpipe NVR projects (please note these are specially interesting as they are related to forward looking graphics/compute integration and implementation in OGL compute should allow tighter integration
and possibly better improvements to graphics assuming someday hardware queues between stages get exposed in OpenGL):
we find code like these:
__declspec(__device__) __inline U32   getLaneMaskLt           (void)                  { U32 r; asm("mov.u32 %0, %lanemask_lt;" : "=r"(r)); return r; }
->latest version optimzed for kepler also uses new vector simd instructions (see hpg2012 poster)
they seem to define also prmt:
__declspec(__device__) __inline U32   prmt     (U32 a, U32 b, U32 c)   { U32 v; asm("prmt.b32 %0, %1, %2, %3;" : "=r"(v) : "r"(a), "r"(b), "r"(c)); return v; }
__declspec(__device__) __inline U32   add_cc                  (U32 a, U32 b)          { U32 v; asm("add.cc.u32 %0, %1, %2;" : "=r"(v) : "r"(a), "r"(b)); return v; }
add with carry for bignum computations one thing I'm sure is that addc isn't exposed in variant where carry is stored in special carry flag bit (CC.CF)
but not sure if graphics shaders expose currently (think yes but carry must be saved as integer and then added so this sequence for 128 bit addition isn't as efficient)
add.cc.u32 x1,y1,z1; // extended-precision addition of
addc.cc.u32 x2,y2,z2; // two 128-bit values
addc.cc.u32 x3,y3,z3;
addc.u32 x4,y4,z4;

Another interesting instruction used in some papers is bypassing or forcing caching loads per instruction etc(8.7.6.1 Cache Operators).. found in cudaraster also:
__declspec(__device__) __inline uint4 cachedLoad              (const uint4* p)        { uint4 v; asm("ld.global.ca.v4.u32 {%0, %1, %2, %3}, [%4];" : "=r"(v.x), "=r"(v.y), "=r"(v.z), "=r"(v.w) : "l"(p)); return v; }
__declspec(__device__) __inline void cachedStore             (U32* p, U32 v)         { asm("st.global.wb.u32 [%0], %1;" :: "l"(p), "r"(v)); }
__declspec(__device__) __inline U32   uncachedLoad            (const U32* p)          { U32 v; asm("ld.global.cg.u32 %0, [%1];" : "=r"(v) : "l"(p)); return v; }
__declspec(__device__) __inline void uncachedStore           (uint4* p, uint4 v)     { asm("st.global.cg.v4.u32 [%0], {%1, %2, %3, %4};" :: "l"(p), "r"(v.x), "r"(v.y), "r"(v.z), "r"(v.w)); }
Finally some codes use %clock and %clock64 variant register for true kernel cycles counting..
We only need to see PTX manual as said for me interesting lacking instructions are (sorry if something is already present):

mul24 (specially useful for thread index,sad(for video operations),bfind,popc, popcll (like clock and others we have here 64bit instruction which allows 2x faster)
The last one was shown also of good use in computer vision (object recognition) http://nvidia.fullviewmedia.com/gtc2010/0923-c-2209.html) sadly no slides only video but
new kepler shufl instructions which NV said improved perf of galaxy code shown in GTC..
and special registers:
%lanemask_eq, %lanemask_le, %lanemask_lt, %lanemask_ge, %lanemask_gt
%clock, %clock64
I'm not even asking about more general features in kernels features such as dynamic malloc, printf support, and true function call in kernels etc..
Well really function call and recursion stuff is a wonderful example of how really simple is to
add support for things by adding only assembly language extensions and using asm blocks without changing GLSL parser :
asm(
".func foo { call foo; .. ret; } … call foo;")
and not even advanced CUDA runtime compute features such as:
*tuning L1 cache size vs shared mem size (16/48 32/32 48/16 currenlty),
*how to expose concurrent kernels (really usefull in sm_35) to OpenGL without stream model
*host pinned mem (use cpu mem from kernel on the fly without transfers) ->should enable out of core compute algorithms..
(that's also interesting as AMD has one extension for it GL_AMD_pinned_memory which works even on discrete GPUs and searching while ago
from nv found NV_sysmem_buffer but hacking to code to use it seems non functional)
*P2P functionality of accesing from kernels mem from other GPUs etc..
(note this is another interesting thing as AMD currently exposes that in OGL world GL_AMD_bus_addressable_memory but seems FirePro only so that would
be possible once GL 4.3 drivers ship also they enable that in OCL world this month with CL_AMD_bus_....) they call it DirectGMA..
Also lacking vs CUD/OpenCL but that should be addressed by OpenGL ARB as isn't NV specific
is allow to programatically exposing work group size like by requiring local_size_x variables not to be constant like (new extension like GL_NV_uniform_local_size)

uniform int local_size_x_from_app;
layout (local_size_x = local_size_x_from_app) in;

or better new function
DispatchComputeSetLocalSize(globx,globy,globz,locx ,locy,locz)
equivalent of CUDA support

and finally also some needed for serious compute similar to usage HPC programs is allowing unroll marks to compiler to unroll selected loops
#pragma unroll
--END of answer--

thah shows in some aspects OGL compute support exposed in 4.3 is really very simple somewhat like CUDA 1.0 and even worse in some aspects specially in regard to exposing ISA of current NV compute cores..
For example some CUDA codes use %clock to measure at cycle count of their kernels.. that's supposedly was shipping in CUDA 1.0 altough using PTX code.. today we can use asm() blocks to define a clock function in CUDA C code and use that similar to CPU world using asm() blocks around TSC counter..

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Monday, 10 December 2012

Exposing advanced PTX ISA in OGL compute..

0 comments:

Post a Comment

Popular Posts

Blog Archive

About Me