What I would want to know and get from vendors part II: Nvidia ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

1. Access to WGL_DX_interop OpenGL extension documentation and headers: this extension is shipping since late August in NVIDIA drivers and is very powerful as it provides a fast path between OpenGL and DirectX (interop) so stuff from one API can be seen by other with no host interaction (I mean no transfers to host as I currently needed).. it was talked at GTC 09 but spec was not released..
Also can you say a expected time of when it will be avaiable for Windows Vista/7 users since it's only avaiable on XP currently..
you may think what that could add to the mix at least two/three things:
*Access to GPU video decode (DXVA) and feed that to OpenCL with lowest overhead..
Nvidia can tell us they have CUVID with OpenGL support but assuming someday ATI supports similar extension we can have a crossvendor code path via DXVA..
*Access to current state of the art efficient fluid rendering is currently shipped as a library part of Physx Screen Saver source code..
This accepts only Direct3D interface so accessing for OpenGL needs some interop if wanting to be done efficiently..
*Accessing Direct3D 11 functionality like tesselation from OpenGL .. interchanging tesselated stuff to OpenGL all in GPU mem..

2.Access to NVAPI NDA SDK: this could enable a killer feature.. since I think since 195 Nvidia drivers NVAPI has the capability of getting GPU load and memory bus load and video decoding unit load (think Bluray GPU decode).. (only GT200 and higher and 190xx)
This is used in GPU-z 0.38 so at least some developer has access to this functionality.. I think GPU-z uses NVAPI..
NVAPI public doesn't expose this..
I think this API allows access to 3D Vision internals stuff.. (see below..)
I have tried to get access but you need to be in Nvidia Registered Developer program.. I have tried many times to sign up but I get no response.. this presumably allows to get also access to latest driver builds..

3. Access to Nexus GPU debugger beta ->released
( Doing GPGPU stuff could be done a lot easier with a GPU debugger.. Nexus was scheduled to get released in beta in October.. I have signed to the beta program but I get no response other than in late October that in two weeks we would get the beta build.. )

4.CUVENC lib headers and documentation: For having GPU video encoding.. Nvidia ships in standard drivers similar to CUVID library CUVENC library for accessing GPU hardware encoding and it's used by a lot of commercial video encoders with CUDA support.. in fact all are using this library..
the problem is that it is only exposed to partners I think.. it's not public.. I think now Windows 7 we have Windows MFT library for accessing GPU video encoding I have to test it..

5.Access to documentation about Fermi OpenGL Direct3D 11 like extensions: there is some info in GTC presentation but still no headers or things for working on "it" for real..

6. Access to 3D Vision internal APIs, thats what's Avatar game are getting i.e. I get access to ways for sending a frame to each eye bypassing Nvidia 3D driver..

more or less the same:

About Nvidia source code
========================

1.OpenCL port of the DirectCompute Ocean demo source code? it was shown in OpenCL tutorial in GTC09..
I hope as Nvidia ships DirectCompute Ocean demo source code, Nvidia Ocean OpenCL demo is going to ship soon in GPU Computing SDK..
can someone confirm that and provide us in the meantime the code?
I would love to learn the differences between DirectCompute and OpenCL from other perspective i.e. seeing such complex code (has high perf FFTs in it) side by side as
I want to make some common wrapper around DirectCompute and/or OpenCL and/or CUDA..

2. Physics demos using GPU Compute APIs either using as a base GPU enabled Bullet code (rigid bodies stuff by Harada) and/or using Phyx fluids but coding efficient fluid rendering is
complex to do..
I have seen Nvidia fluid demo (OpenGL) use this technique:

"Screen Space Fluid Rendering with Curvature Flow"
Wladimir J. van der Laan, Simon Green, Miguel Sainz
Some authors are Nvidia guys..

also seems "Physx Screen Saver" uses it (DirectX)
the code is avaiable http://files.thegamecreators.com/darkphysics/ScreenSaversource.zip
but the rendering fluid functionality is a directx based compiled lib:
dxFluidRenderLib.lib
dxFluidRenderer.h
As I want multiOS support I would love or source code of that library so I can modify for OpenGL usage or compiled OpenGL based libraries for
Win/Lin/Mac ..

3. Massiliamo Fatica of Nvidia done a port of Linpack to use both CPU+GPU load balancing them..
"Accelerating linpack with CUDA on heterogenous clusters "
in CUDA forums said that is distributed to universities.. can I get it?

About kernel binaries:
=======================
I think that's the most ridiculous question but anyway for CUDA and OpenCL we can store "compiled" kernels in PTX and launch kernels from that code..
I know that PTX is virtual isa so allows you to target multiple architectures now my question is if PTX generated by nvcc or OpenCL builtin compiler
is mature enough that can not pass that say one year ahead new OpenCL builtin compiler or new say CUDA 4.0 nvcc gets PTX that in turn provides better performance..
I hope a generated PTX generated now achieves same performance that if we compile the kernel to PTX next year..
i.e. that all optimization can be extracted from PTX code..
If not I will have at least for OpenCL to supply kernel source files and compile on the fly..

Also compiling CUDA 2.3 kernels we get PTX 1.4 and OpenCL generated PTX is v1.5 and in CUDA 3.0beta (at least for Fermi target) I seem we get PTX 2.0..
In SDK we get v1.4 doc, current CUDA 3.0 SDK beta 1 provides no PTX 1.5 nor PTX 2.0 info..
Can we get access to these new PTX specs documentation?..

About CUDA 3.0:
===============
It will be good having a module that is able to get info about specific instruction issue rate and latency similar to GPUbench
http://graphics.stanford.edu/projects/gpubench/test_instrissue.html
Well the problem lays in that there are currently some PTX instructions that aren't visible from CUDA C..
This guy for example exposes native addc instruction:
__addc / __uaddc: signed and unsigned addition-with-carry. Carry flag after addition is set automatically.
http://www.mpi-inf.mpg.de/~emeliyan/cuda-compiler/
You can find a paper where he motivates this effort for having some speedup in some integer related scientific codes
see "Efficient Multiplication of Polynomials on Graphics Hardware"..
He is providing a diff to cuda Open64 sources (2.2 I think) and also new headers..
can be this support be added so I perhaps we can instruction issue rate of this instructions..
if not I can manually compile patched sources for every architecture of our benchmark.. (Win,Lin,Mac)(x64 and x32) but I think I will do not..
I said that because some integer multiprecision libraries have a similar problem (It's impossible to access add with carry op from C without having to
add assembly code..)
Now a mix of some previous questions:
It's possible to access native add with carry in Nvidia GPUs in OpenCL?
I think the answer is no and I believe that could be fixed if there was interop between OpenCL and CUDA generated PTX code.. I would with the Cuda addc
enabled compiler compile and addc function and call that from OpenCL..
Anyway also having PTX 1.5 spec documentation will helpfully to find how to patch PTX OpenCL generated code for using that..
Yeah I know that all of this is not in the OpenCL spec support.. but anyway worth investigating..
(I will love to ask this to AMD engineers also enabling use of add with carry if existant in r8xx via use of AMD IL generated code ..)

I see that CUDA 3.0 has surface instructions cusurf..
this is Fermi stuff correct?
Seems that this instructions allow "true" writable textures (I mean without having to use CUDA 2.2 "texture from pitch linear mem" functionality)..
and so have (x,y) addressing for writing to it (so its equal in concept to DirectCompute RWTexture2D?) and presumably format conversion on read/write(?)..

The unique objection I have is I can't find in headers 3D surfaces but I hope 3D surfaces are supported similar in hardware in Fermi due to RWTexture3D in D3d 11 so
I can expect to have 3D surface functions in CUDA 3.0 with Fermi (i.e. I want "true" writable 3D textures..).. I want that to use for 3D stencil codes..
For GPU codes without this support I can use 3dfd code of Nvidia GPU Computing SDK that I think is based on:
3D finite difference computation on GPUs using CUDA
de P Micikevicius - 2009

About CUDA multicore:
I know that Nvidia is still working hard on it because of:
1.http://llvm.org/devmtg/2009-10/Grover_PLANG.pdf
"PLANG: Translating NVIDIA PTX language to LLVM IR Machine"
2.I have seen in CUDA 3.0 beta nvcc binary some strings related to multicore-llvm
seems you have switched from the idea of:
"MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs"
to a more hopefully better one i.e. translating from PTX to LLVM and then using
LLVM efficient bakends for x86..
The question is if that is going to be avaiable soon enough
This will allow me to compare perf in this mode versus check the perf of CUDA ported to OpenCL and then run on OpenCL AMD CPU backend..
or having to write efficient CPU codes..
I'm thinking in sort test examples:
Currently GPU fastest seems to be:
"Designing efficient sorting algorithms for manycore GPUs"
Nadathur Satish, Mark Harris, and Michael Garland
the code it's in CUDPP 1.1 and CUDA SDK sample already..
The problem CPU most efficient one seems to be "Efficient implementation of sorting on multi-core SIMD CPU architecture"
and care has to be taken of writing SIMD enable code..

DirectCompute questions:
=======================
As I know there a basically two models 10_x and 11_0 i.e DirectCompute 4.x and 5.0..
well the problem lays in that CUDA hasn't any restrictions on writing to shared mem and codes I plan on using presumably uses atomics on global mem (as CUDA GPUs except g80 support it.. somerecent codes use it..(?))
this code isn't going to automatically translate to DirectCompute 4.x..
This is no problem for Fermi and AMD 5xxx GPUs but as I think DirectCompute 4.x takes the "greatest common divisor" between CUDA cards and ATI 4xxx
CUDA cards are greatly in disadvantage.. so my question is if Nvidia can and want fix this issue..
Think similar as I remember to have read Nvidia enabled a d3d 10.1 feature in some driver for FarCry2 ?(related to multisample)
I mean at least it allows to compile kernels to cs_5_0 target in GT200 cards for example..
I know some things of these target aren't avaiable as shared mem size in GT200 cards for example is below required but I mean that if kernel uses GT200 hardware restrictions
(for example shared mem usage below 16K) features and requiring hardware resources avaiable in CUDA cards this could be enabled..
This could be a NDA feature(?) for example enabling cs_5_0_gt200 target (it's possible?)
Also a similar hack for enable doubles on GT200 via directcompute..

OpenCL:
======
Well I have to be frank I can find any issue worth mentioning in 195 drivers excepting:

1. I'm not happy with OpenCL Volume3D demo in Windows XP goes nearly as fast as CUDA one.. In fact I get sustained 60fps in CUDA vs 40-60 fps in OpenCL
with a 8600gts.. Note the same OpenCL Volume3D demo run at mediocre 14fps in a high end desktop with gtx 275 in OpenCL in Win7..
while the cuda demo runs at 60fps.. all 195.55 recent OpenCL drivers..
I think Linux OpenCL doesn't suffer also..
So seems the CUDA texture 3d support is good whatever OS but OpenCL Image support for 3D textures has perf issues in Vista/7 systems..
can anyone confirm if they going to fix soon or already fixed?
Doesn't seem ok to say it's because WDDM as CUDA seems not affected..
If I say this is because I want to love to code some volumetric rendering code also perhaps with 3D Vision builtin optional feature and seems that code will suffer with OpenCL backend..

2.I'm waiting for cl_khr_3d_image_writes..
is this is similar in concept to RWTexture3D I think, correct (i.e. (x,y,z) addressing etc..)?
but I think there is going to be hardware support for it only in Fermi and higher, correct?
Assuming that this is Fermi stuff will be avaiable say by Fermi launch drivers or it's already supported in 195.62 if we have a Fermi or there is no
specific time?
I think this allows high perf implementation of 3D stencil codes on d3d 11 architectures as texture is directly written using coordinates and reads
gets cached and at least this and advantage for architectures without global cache (AMD 5xxx cards)
Of course I'm aware of alternative techniques chaching neigboorhood values in shared mem and calculating the stencil from these values..

3. Could say at least if there is any way (hack) for accessing host mem from Nvidia GPUs in OpenCL backend ( pinned system mem in CUDA parlance )
I have no problem even if it's playing with PTX code..
If I say that is because I want to run kernels over big problem and
that would perhaps be lenghty in time enough so that a progress bar would be welcome.. I know of the watchdog time issue for kernels running for more
than x seconds and I think also Nvidia recommends dividing the kernel for solving this issue..
Yeah I know doesn't Nvidia recommend that..

Better would be some roadmap on an extension supporting this feature (by the way it's supported by the hardware on AMD cards also)

Mac stuff
=========

CUDA Mac:
Of course I plan to run OpenCL-OpenGL interop eneabled codes and CUDA-OpenGL ones and perf of course the CUDA benchmark on the Mac will suffer as still is going thorugh host the interop..
Can we expect it fixed sometime say before April-May 2010..

OpenCL Mac:
Can someone confirm if double extensions is going to be avaiable say in April-May 2010 (10.6.3-4?) on Nvidia GPUs GT200 GPUs for example similar to with 195 Windows and linux drivers.

OpenGL Mac:
Sorry if I'm so ignorant in this matter..
but what's the problem about Nvidia Mac drivers shipping still OpenGL 2.1 drivers
(yeah with some 3.x stuff).. I remember seeing a Nvision08 presentation by and Nvidia OpenGL guy saying coming to Mac all the OpenGL 3.0 stuff at that
time..
Nvidia is shipping in drivers download page "custom" drivers for GTx285 mac edition why can they ship custom drivers if not with 3.x support at least with all 3.0,3.1,3.2 ARB equivalent extensions and possible other Nvidia extensions..

Optix and Physx for Mac?
Assuming we want to port some simple GPU raytracing and GPU physics code can we have it working on Mac.. As nor Optix nor Physx libraries are avaiable for
Mac currently..

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Sunday, 13 December 2009

What I would want to know and get from vendors part II: Nvidia

0 comments:

Post a Comment

Popular Posts

Blog Archive

About Me