2010 ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Saturday, 10 July 2010

Some news!

Posted on 11:16 by Unknown

News:
*Gpu computing gems 1 or GPU gems 4 source code already avaiable in gpucomputing.net:
Book for November..
Right now:

Title
A Programmable Graphics Pipeline in CUDA for Order Independent Transparency	1 new	07-10-2010
High Performance Iterated Function Systems	0 new	07-02-2010
CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm	0 new	07-01-2010
Connected Component Labeling in CUDA - demo+code	0 new	06-30-2010
A Practical Guide toMassively ParallelMonte Carlo Simulations: The Ising Model	0 new	06-30-2010
Parallel LDPC Decoding using CUDA	0 new	06-30-2010
Path Regeneration for Random Walks	0 new	06-30-2010
GPU Gems 4: Deformable Volumetric Registration using B-splines Source Code	0 new	06-30-2010
Monte Carlo Photon Transport on the GPU	0 new	06-30-2010
Lattice-Boltzmann Lighting Models - Source Code	0 new	06-30-2010
RNA folding GPU	0 new	06-30-2010
Haar Classifiers for Object Detection with CUDA: Pixel-parallel processing kernel	0 new	06-29-2010
Multiclass Support Vector Machine	0 new	06-29-2010
Parallelization of the x264 encoder using OpenCL	0 new	06-21-2010
Cone-Beam CT image reconstruction using the Katsevich Algorithm	0 new	06-21-2010
Line forward projection on CUDA	0 new	06-11-2010

seems MareNostrum getting a rack of Fermis perhaps with IBM Power7

see now Nvidia would have to publish a PowerPC arch CUDA driver?

Or using PathScale with full open source based computing stack..
avaiable here branch from noveau:

http://github.com/pathscale/pscnv/commits/master

Seems Nvidia TCC supporting driver Fermi in IBM web site version 197.81

Catalyst 10.8 beta seems avaiable 10.7 coming 21/7..

Physx 3.0 coming with CPU improvements:
*auto threading
*sse enabled by default
Mafia has new runtimes NVIDIA PhysX driver: 10.04.02_9.10.0522.
Mueller has post paper of Fermi launch demo using water heigh fields plus particles..
Two other papers interesting from Nvidia research are:

HLBVH: Hierarchical LBVH Construction for Real-Time Ray Tracing
PantaRay: Fast Ray-traced Occlusion Caching of Massive Scenes

Hwu based course from Stanford:
http://code.google.com/p/stanford-cs193g-sp2010/wiki/ClassSchedule

Two interesting conferences program avaiable:

PACT
has intel gpu paper demystifying ..
also Revisiting Sorting for GPGPU Stream Architectures
which achieves near 500mkeys/s on gt200..

there is a workshop on gpus
http://informatik.technikum-wien.at/gpusca/
and web doesn't work.

The Nineteenth International Conference on
Parallel Architectures and Compilation Techniques (PACT)
Vienna, Austria, September 11-15, 2010

Interesting papers:
Scalable Thread Scheduling and Global Power Management for Heterogeneous Many-Core Architectures
Dynamically Managed Multithreaded Reconfigurable Architectures for Chip Multiprocessors
WAYPOINT: Scaling Coherence to Thousand-core Architectures
Scalable Hardware Support for Conditional Parallelization
Less is More: Trading off Work-Efficiency for Scalability in Irregular Programs
Revisiting Sorting for GPGPU Stream Architectures
D. Merrill, A. Grimshaw
An Integer Programming Framework for Optimizing Shared Memory Use on GPUs
W. Ma, G. Agrawal
DMATiler: Revisiting Loop Tiling for Direct Memory Access
A Software-SVM-based Transactional Memory for Multicore Accelerator Architectures with Local Memory
Automatic Vector Instruction Selection for Dynamic Compilation
An OpenCL Framework for Heterogeneous Multicores with Local Memory

SC10

I would like to review this papers:
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
Parallel Fast Gauss Transform
Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers
The Multi-Scale Heart Simulation on Massively Parallel Computers
Using 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
An 80-Fold Speedup, 15.0 TFlops, Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code
Exploiting 162-Nanosecond End-to-End Communication Latency on Anton
Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories
Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Scalable Graph Exploration on Multicore Processors
The 48-core SCC processor: the programmer’s view
Exploring a Novel Gathering Method for Finite Element Codes on the Cell/B.E. Architecture
Reducing Multicore Bandwidth Requirements for Combinatorial Multigrid
Diagnosis, Tuning and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method
Scaling Hierarchical N-Body Simulations on GPU Clusters
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance
The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches

Posted in | No comments

Sunday, 4 July 2010

DirectCompute Double precision Mandelbrot demo and more..

Posted on 19:41 by Unknown

In addition to first demo using double precision on GL 4.0 here now on DirectCompute:
THIS DEMO NEEDS DX JUNE 2010 RUNTIMES
so update if needed

this test on AMD shows a ATI DirectCompute DPFP bug.. it shows incorrect rendering..

Also note I learned DirectCompute doesn't admit division with doubles so I have to change /2 with *0.5.
Nvidia Fermi works OK!
DirectCompute Double precision Mandelbrot (includes source based almost 100% on Voxilla demo):

use test.bat app starts at big zoom so it shows DP in action.. if you exit with esc then shows same rendering at SPFP.. note with mouse you can zoom in out..

bat calls mandel.exe 0 for SP or mandel.exe 1 for DP..

Also note I expected better perf for AMD than Nvidia but two work very slow i.e. Nvidia runs at full speed (i.e. capped 8x vs Teslas) but AMD has perf issues as it should run at least 3-4x vs Nvidia Fermi..
Also has vector mode running somewhat faster than scalar shader (but not much could run up to 4x faster if compiler didn't extract perf of scalar code but runs not much faster compared to SP where vector code outperforms scalar code by a higher amount).. fermi perf is unaffected by using vector code..

Correct behavior:

Double precision (on GTX 470)

See full Window

Single precision

On AMD 5850 DP renders as (i will post image soon):

Related also I patched Nvidia Physx Demo to work on AMD changing GLSL code using Cg non standard functions.. it exhibits some OpenGL bugs.

Instructions:

Download Nvidia Physx Demo here ((select FLUIDS: TECHNOLOGY DEMO)

and use this exectuable for running on AMD cards (extract on demo dir).
It shows artifacts on AMD card not on rendering but on desktop outside of program window..
On AMD 5850 DP bad renders as (i will post image soon):

Posted in | No comments

A lot of things you probably don't know.. and a worth it..

Posted on 12:06 by Unknown

*TCC support for GF100 products will be out next week also this drivers will add support for simultaneously running this drivers with normal graphics drivers (that support OGL,DX,DXVA,etc..) I suspect graphics and TCC driver will have to have same version as both write dll's in windows system..
I hope still inf trick works so I can enable on Geforce Fermi and also that this works with Nsight also.. anyway is not severe as 25x drivers seems to add support for CUDA cards (Geforces even) without extending desktop on it so kernels exec time needn't be time limited for TDR.. before it required to use two Nvidia cards and one can be not desktop extended but if you used say a ATI card and a Nvidia card without desktop extended on Nvidia so to use Nsight for example (which requires no desktop extended) it will fail since CUDA will not find a CUDA card..
*There is support for Fermi on MacOs right now on Nvidia 19.5.8f03 drivers released month before but wuthout reposting so have NVDAGF100HAL.kext..
Anyway it only works OGL support as both CUDA and OCL don't use it..
I have to use NVloader injector which anyway doesn't work with Fermi on 64 bit kernel mode.. note gf 275 works in 64 bit with this injector also..
note i wanted to fix and all I found was a cuGetExportTable and something like MacCompatibiltyTID used by a checkcompatibility executable perhaps fixing it will work..
One in Nvidia forums assumed OCL broken fixed creating a OGL context beforce searching for OCL devices (oclgetdevice) but this trick didn't work..
*Storing ELF binaries instead of CUBIN deletes use of decuda hopefully one very interesting solution is..

*Seeing MAGMA webinar seems big release for SC2010 with some big features check magma presentation for what to expect..
*Physx 3.0 nearing to launch as Physx Visual Debugger includes support for it in release note says..

Note this brings concurrent kernels support for Fermi for improved perf on physics simulations.. hopefully also includes wrinkle meshes feature studied by Mueller.

Note also GPU AI notes once Function pointers supported on CUDA will use it so expect a new release sometime optimized even more for Fermi too..

Probably anuonced at Siggraph.. even launching later..

Hope too see also APEX shipping for other than Big AAA games i.e. downloadable for everyone..

Lastly I expect Optix 2.0 and Cg 3.0 final for Siggraph and let's see also in time OpenRL with OpenCL support for GPUs would be interesting for ATI.. Note also Luxrender GPU 1.6 brings Stocasthic Photon Mapping and uses OCL on ATI GPUs also..

*Nsight also is moving fast from beta in early June now is RC state.. launching at siggraph?

*ATI Doubles on DirectCompute are broken.. altough feature flag is supported..

now we can test it with June DX compiler before it was broken for doubles inside control flow (loops, if,etc..)
Mainly compiling works but rendering shows issues vs Fermi which supports nicely..
Download my code.. (coming soon..)
*ATI GLSL driver is somewhat broken at least seems to geometry shaders as I fixed Nvidia Physx fluid demo to use non Cg code on GLSL code and some other fix related to point rendering and now seems to work but not without instabilities present as noise in screen even outside the window it fills..
Download ant test.. (coming soon..)
Also GLSL driver don't implement fetching integer textures with integer coordinates (texel2Dfetch( itex))
*CUDA 3.1 ships with three interesting examples: one is oclTridiagonal a fast tridiagonal solver.. interesting for a DoF cinematic renderer as in Metro using OCL/OGL..
other one is oclCopyComputeOverlap shows two things one is that concurrent kernel and exec is possible in OCL.. via command queues also shows there is an issue in 25x drivers that prevent full scaling I think good is 30% faster code and I obtain 20% on 25x drivers.. on 197 drivers I obtain 30%..
note that on both ATI and Apple platforms even with Nvidia GPUs exhibit no scaling and even negative scaling (-15%)
Good is that is fixed issue in 258.19 OCL 1.1 preview drivers with report CUDA 3.2 so I obtain back 30% overlap.. Note that other 258 drivers don't work (as they report older CUDA code 3.1 and OCL 1.0)..
One more interesting thing is that supposedly even dual dma engine is suposed to work on ocl so overlap would be 50%.. seems restricted to Tesla but Nvidia has been less detailed than double capping on Geforce..
Luckily I have a trick for you 197.44 driver seem to support Dual DMA engine on Geforce Fermi too!
This is OGL 4.0 driver so all you lost to current 256 drivers is CUDA 3.1 features only.. Linux also use OGL 4.0 driver on developer.nvidia.com and you have it...
Note also 197.75 etc don't work only work with this..
*So seems DUAL DMA engine is broken/disabled on Geforce Fermi without any reason other than economical..
*CUDA simpleStream seems to show broken streams on Fermi but it's due to not sending enough work.. a simple fix..
*Matmul by Lschien is one of the fastest ones for CUDA but it fails currently on fermi due to using cubins with obtained modifing tesla asm via decuda cudaasm.. thanks god seems related to volatile keyword don't working correctly pre cuda 3.0.. author suggest a fix assuming this works that uses cuda variant 6.. I have tested and it works so it's fixed I obtain near 850Gflops on Fermi 470 at 1650Mhz..
*Lot of soft updated to CUDA 3.x even 3.1 right now: NPP 3.1,CULA 2.0, JACKET 1.4,OpenMM 2.0 on Zephyr SVN, Gromcas 4.5 beta,GMAC, etc..

More news:

Also Nvidia has released a lot of drivers on 256 brach lets see rough differences/progression:
197.44 first OGL 4.0 driver and also unique supporting Dual DMA engine on Fermi on on Tesla/Quadro boards.. also has no issues in single dma..
256 add cuda 3.1 currently all has issues in concurrent kernel and exec on Fermi at least on OCL
257.15 bluray3d
257.19 nsight june beta drive
257.21 whql (supports nsight)
257.29 ion support accelerated dxva flash with pciex 1x devices
258.18 ocl 1.1 beta (says cuda 3.2!) fixes oclCopyCompute issues (but single DMA on Fermi)
258.48 first supporting Quadro Fermis..

258.69 shipping with 3d vision surround (Nvidia ntersect says youtube 3d support coming soon.. also I hope they add windows DX 3d vision support soon..)

Some other striking news :-) are:
*OpenCurrent 1.1 ships with CUDA 3.0 and multigpu code..
well I have been testing with CUDA 3.1 because I have Ubuntu 9.10 and with CUDA 3.1 GCC 4.4 works ok (so Ubuntu 10.4 is right also..) and has some issue related to now supporting true functions I think I must add some static to a function as cuda 3.1 release notes porting guide says.. with CUDA 3.0 GCC 4.4 doesn't work so I have to check with a Ubuntu 9.04 if I don't fix..
*OpenMP to CUDA compiler is avaiable in Cetus 1.2.
*PGI 10.6 is avaiable integer support in kernels and VS 2010 support at least.

I have tested GATLAS and is good at least 260 gflops on a gtx 275.. and I tested on MAC so at least works in Lin and Mac without much work and says author with 5870 and stream 2.1 achieves some image kernels 1,3 tflops so similar to cal++ matmul in OpenCL! have to test or modify code(?) for double testing..

Some tricks and work to do:

RAW DATA:
I know its lame but at least you can emulate 3d image writes on cuda with surfaces using ptx 3d tricks (post later).
I have to put a sample of CUVID on MAC.
SimpleStreams in cuda seems fermi bad in forums says increase work to 500.
matmul chien says put volatile and check (works!)
bsgp fermi support checking mail with author..
sparse matrix ati code test on fermi..

See fermi benchmarks:
nvidia benchmarks in blog
openvidia benchmarks..
cula blog
jacket blog
same papers of hpg2010 presentations billeter scattering and aov mcguire..
seems also code of rasterization and color stocastic shadow map coming soon..

Posted in | No comments

Saturday, 3 July 2010

ATI Stream SDK roadmap

Posted on 06:40 by Unknown

I have found a roadmap of ATI Stream SDK till end of year:

DISCLAIMER: It's on Internet and found with some luck.. no breaking of NDA

Let's talk about it..
currently AMD OpenCL lacks:
*opengl interop issues:images interop issues (for example copy buffer to image where image is opengl tex acquired doesn't work)
*expose multiple component images (other than rgba)
*DX interop
*expose all graphics mem (currently 128-256mb)
*Catalyst integration

Stream SDK 2.2 Adds:
*OCL 1.1 (3 component vectors is part and image support ocl 1.1 is multiple component images (r,rg,rgb))
*DX10 interop (seems only that no dx9 or dx11 as Nvidia has)
*mem fences don't generate unneeded barrier isa instructions
*append buffers (what about also about GDS extension)
*seems atomics ocl 1.1 is nothing new? and offline compilation goes final from preview and dpfp adds fma as others are supported now(?)
dpfp fma should allow peak test kernels in benchmarks showing high numbers.. near 400-500gflop/s..

A lot more interesting is 2.3:
*In process compilation of OpenCL kernels means no shipping LLVM compilers (llc,etc..) and hopefully means will be integreated in atiocl.dll so it can ship OpenCL builtin in Catalyst 10.12..
*Library models
*C++ template support in kernels (I hope this means you can specify at least kernels args depeding on template argument for supporting double and float kernels with one code for example similar to CUDA support)
*Adds trig DPFP routines (but still no complete DPFP support seems so horrible as Nvidia shiping since October 2009 and AMD said support coming gradually since end 2009)
The more interesting is last three:
*FFT library: why not also a blas lib, I suspect is ocl based as directcompute has its fft lib
also is going to be part of acml? currently matmul in acml gpu is cal based..
At least I hope to be only binary library and also for Win and Lin so for Mac I hope somehow we can extract OpenCL kernels or create a wrapper around it and use Wine or something like this to test perf on MAC on AMD boards is correct..
*OpenPhysics: well at least some to play, I expect cloth, soft body and SPH particles support in OpenCL and/or DirectCompute.. well in bullet site there is a preliminary executable with cloth demo and AMD worker talking about state of soft body support (http://code.google.com/p/bullet/issues/detail?id=390#c3) seems since last week also we have directcompute and opencl code for both cloth and soft body in trunk..
Also by September we will have DMM 2.0 as said in GDC that has some OpenCL love for this rigid body+fracture simulatior..
*OpenDecode UVD: Well a cuvid/vdpau library for AMD boards.. Nvidia has put lot of love to GPU video decoding and interop with CUDA/OpenGL with CUVID for Win and Mac and VDPAU for Linux..
VDPAU has since 256 drivers efficient OpenGL and CUDA interop.. CUVID has by def efficient CUDA interop and fast OpenGL/DX interop in Windows.. CUVID for MAC only seems good for feeding data to CUDA as OpenGL interop in MAC is slow right now (and has been so, since ever)..
I expect this brings fast interop to OpenCL on Win and Lin and that adds to DXVA DX interop on Win and AMD xvBA on Linux which VAAPI wrapper seems to provide fast OGL interop..
So Mac seems left but I hope recent video acceleration API on 10.6.3 supports AMD 5xxx cards when released and also that VC1 support is added in addition to h264.. I think this provides fast path to OpenGL textures so as OpenCL/OpenGL interop is fast on Apple provides also OpenCL interop on that platform..
Another thing is if Dual Stream acceleration will be exposed and supported.. on Nvidia I think both DXVA,CUVID and VDPAU expose with a GTX 470 at least..
Also related is Catalyst 10.7 having improved support for VLC 1.1.1 DXVA decoding for AMD cards which I presume relates to fast path GPU/CPU sending of frames works..
Remember also last month Nvidia released a ION driver (257.29) improving perf with DXVA on ION with PCIex x1 as Flash requires (GPU->CPU->GPU roundtrip)..

What's left after OCL 1.1 and stream sdk 2.3:
Well I expect Global Data Share and shared registers extensions,3d image writes, true complete DPFP support (cl_khr_fp64), complete BLAS and FFT lib (as CUBLAS and CUFFT in CUDA), pinned mem working, host mem accessible from GPU extension, gather4 instructions for image support in OpenCL, and working concurrent kernel and mem transfers (i.e. concurrency in oclCopyCompute CUDA 3.1 example >=20%)

Posted in | No comments

Wednesday, 5 May 2010

About AMD OpenCL 2.1!

Posted on 11:57 by Unknown

AMD is progressing good and now we have an OpenCL stack with a lot features/optional extensions published and even AMD propietary ones:
regarding supported extensions:
*Image support: well only on 5xxx GPU (i don't know but I expect for CPUs also support as Apple CPU implementation? 4xxx don't expect but should be possible (CAL supports image/textures on 4xxx)))
right now only RGBA formats: but only supports 10/11 formats which are the obligatory ones (Nvidia has 7x).. well all rgba 4 channels so some Nvidia examples won't work..
well in 2.01 you can use export or set GPU_IMAGES_SUPPORT and get it on 5xxx..
no support on CPU also..
2.1 really has 3d tex support (didn't work in 2.01 hack)..
You can test Nvidia ocl samples oclVolumeRender and oclsimpletexture3d if you change samples to load on a 4 channel tex:
basically change in initCLvolume or oclsimpletexture3d h_volume to use 4 channel in initCLvolume:
volume_format.image_channel_order = CL_RGBA;
volume_format.image_channel_data_type = CL_UNORM_INT8;

        uchar * h_volume2=(uchar *)malloc(volumeSize[0] * volumeSize[1]*4*volumeSize[2]);
        for(int i=0; i<(volumeSize[0] * volumeSize[1]*volumeSize[2]); i++)
        h_volume2[4*i]=h_volume[i];
        d_volumeArray = clCreateImage3D(cxGPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, &volume_format,
                                        volumeSize[0],volumeSize[1], volumeSize[2],
                                        volumeSize[0]*4,volumeSize[0] * volumeSize[1]*4,
                                        h_volume2, &ciErrNum);

Also a bug mentioned in developer notes is linear filtering can't work if setted constant via
constant sampler_t volumeSampler = CLK_NORMALIZED_COORDS_TRUE | CLK_ADDRESS_CLAMP | CLK_FILTER_LINEAR;
(also note CUDA 3.0 final has a bugs regarding linear filtering on 3d tex samples and Nvidia and AMD OpenCL samples aren't working on other IHV OCL because some need constant or __const samplers and others not work with that I don't remember)
so I have to comment this sample in volumesample (simpletex3d does the right ting) in cl shader and setting via adding a parameter
__kernel void
d_render(__global uint *d_output,
         uint imageW, uint imageH,
         float density, float brightness,
         float transferOffset, float transferScale,
         __constant float* invViewMatrix
#ifdef IMAGE_SUPPORT
          ,__read_only image3d_t volume,
          __read_only image2d_t transferFunc,
        sampler_t volumeSampler

#endif
         )
then you can add form simpletex
case 'f':
            linearFiltering = !linearFiltering;
            ciErrNum = clSetKernelArg(ckKernel, 10, sizeof(cl_sampler), linearFiltering ? &volumeSamplerLinear : &volumeSamplerNearest);
            shrLog("\nLinear Filtering Toggled %s...\n", linearFiltering ? "ON" : "OFF");
            oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
            break;
to keyboard gl..
Also checked simultaneous image and opengl interop and it worked..
http://dl.dropbox.com/u/1416327/clinterop2.c
define USEGL or not to check image support or simultaneous image and opengl interop.. (clcreateimageformgltexture..)
note in gl interop there is some image updown and some greener image but Nvidia OCL gets similar output so I have to revise code but for now is working..
Lastly what's lacking is 3d image write support but using a sample (using amd new simple image has some more or less disabled 3d texture write test so changing a few simple lines you can test)..
currently I see even cl shader compiler has imagewrite 3d signature so when changing code if you pass a 3d image object and using int2 for coords says it needs a int4 arg.. changing the code the error you have is "I can't find builtin function #xyz" so seems all is well in place.. including the #pragma enable image 3d writes fails saying extension not know but anyway seems like perhaps next version has this support and implementation more advanced than Nvidia?
GL Interop:
Well AMD example has VBO example and works.. using oclPostprocessGL as PBO example also works..
Even changing code in these two demos for creating VBO and PBO GL objects before CL context creation works and that shouldn't work as is said to be a limitation..
So seems current limitation is GL context before CL context which is per spec as createcontext needs gl context..

Also as said before we have image support GL interop working..
Byteaddresable well works but at IL level seems is some and and or masks so hardware has no native byte addressing also by the fact that IL shows UAV which is a dx concept that needs 32bit aligned accesses so I think not native also even UAV DX byte buffer allows byte addressing but as said at 32bit aligned.. general UAV a like int vectors so a[1] is as a byte pointer a[4]..
I have to see how can AMD fight against race conditions if not native when multiple threads write bytes in same word as if doing RMW must use atomics?.. and overhead

What troubles me the most is that Apple demos as GL interop fails but with GL interop is using image support
and also some copyimagetobuffer or buffertoimage so I have to see if is GL interop problem, image support problem or copy problem.. then I will release it..

regarding samples it has new boxgl sample not mentioned:

AMD is publishing a lot of extensions (some very simple):
*amd_printf: I have checked and now works now with Visual Studio (2.01 with Linux?)
if you don't enable explcitilly compiler fails.. previously no way to disable it..
*amd_fp64: GPU no changes as 2.01 so +-/* only and in CPU whealth of features but no conformance and strictness so no dmad i thing so how can GPCbenchmark get so high gflops in doubles without mad I don't know.. also I don't know if mads are generated for integers now as it seems to use it but last time i checked 2.00 in january didn't use that.. also what about mad24
*amd_media_ops: could obviate pyrit cal++ implementation that on trunc or svn has 2x-3x improvement over OpenCL due to to bitalign use now you can use on opencl now.. would be interesting to see if cpyrit gets support now that on trunc also code used rotate opencl native instruction for better possibly perf with ISAs having it.. also it has SAD support that was anounced by AMD to add to OpenCL on 5xxx launch
in binary there are hints of:

amd_vector3 I assume defines float3 or no.. i think nvidia hasn't it even unofficially so good to have..
also some apple demos #define float3 so good to be able to disable to it similar to printf as this code should now work on AMD without any modification..
amd_atomic_counters similar to unpublished glsl atomic_counters?
also ext_device_fission is currently lacking extension ocumentation.. and only cpu but seems to expose concurrent kernels on Fermi GPUs so hope Nvidia supports it.. anyway it's a shame using two or more commanq queues aren't able to extract perf in Nvidia as Nvidia supports it in CUDA via streams which is a similar concept.. I have to post the code I coded to check it..

Also now I have found trick to enable fully working GLSL sprites used in Nvidia OCL samples and Particles demos simply by changing in fragment shaders tex_coord[0] glPointCoord..
(thanks pboudier AMD forums) before you can show as point redering particles use 'p' key or with menu optiuon..

Regarding samples interop many bugs are fixed but only remain the ones due to architectural differencees:
mainly warp related, shared mem size, workgroup size and other out of resources limitations (register stack?) etc..
Particles and Sort on AMD examples need a fix I posted some time ago..

Biggest complains/suggestions and bugs/limitations are:
*Byte addresable HW native? thread race conditions issues with different byte and same word by multiple threads or perf issues due to atomics usage?
*More image formats support (at least R and RG with half float, float and int8/16)
*3d image writes
*d3d9 and 10 interop: disabled in 2.1 (worked in 2.01?) supposedly coming in next version in Q3 anyway a new khr d3d10 extension is published on Khronos which is similar to nvidia but differs in supported a shared handle parameter and a flag in device info saying if it will get improved perf interop with a shared handle..
Would be good KHR d3d9 and D3d11 extensions as Nvidia and AMD supporting it..
for example DXVA->opencl via this extensions should enable MultiIHV via badabooms in the decoding part and perhaps full using MFT GPU encoders..
*Doubles still lacky on GPU (+-/*) and not conformant on GPU
*No device fission on GPU as AMD shared it's stream processors have support for it in HW at least the 80 shaders blocks so 20 conc kernels in 5xxx theoretically possible.. but I think is a CAL API moslty limiation or AMD IL so can take a while to fix?

So biggest Nvidia remaininglimitations now are:
*3d image writes
*Conc kernels on Fermi->No device fission on GPU or using multiple command streams..
Also Dual DMA is usable?

Posted in | No comments

Tuesday, 6 April 2010

Mandelbrot using OGL 4.0 features (double precision and precise keyword)

Posted on 19:57 by Unknown

http://dl.dropbox.com/u/1416327/mandeldouble.rar
above executable contains:
*uses gl_arb_gpu_shader5 in a float-float implementation with precise keyword for fixing agressive Nvidia compiler
*uses arg_gpu_shader_FP64 with doubles.. and fallbacks to doublepAMD on catalyst no ogl 4.0 drivers..
*normal mandelbrot implementation

on AMD 5850 with 1920x1080 res ati gl 4.0 drivers
I obtain:
*15fps using float-float approach..
*50fps using doubles with ati gl 4.0 drivers
*130fps using single precision
Note pre GL 4.0 drivers using doublepAMD attain 36fps on double precision now gl 4.0 drivers either doublepAMD or double attain 50fps..
You can deduce Gflop/s seeing glsl code.. it's very high..

I use #if 1 instead of #ifdef GL_arb_gpu_shader5 or shader_fp64 as then shaders work on Nvidia GL 3.3 drivers altough without doubles (instead double precision) and without precise keywork so float-float is still bad!
i.e. I force #pragma extension enable

Sorry for big exe is linked to Cg altough not usingly now it was used for correct disabling of optimization on Nvidia.. but it's not working now
program arguments are first pixel start horizontal offset for multimonitor setups second fullscreen or no then fragment and vertex shader and then zoom and x and y offset in mandelbrot..
It's used for showing a enough zoom for seeing diff between single and bigger precision either double precision or float-float.. last argument in use glsl or cg backend..
but as said cg is broken..

seems amd doesn't optimize so many as float-float without precise works ok!

AMD 5850 with ogl 4.0 drivers windows 7(with fps)
http://dl.dropbox.com/u/1416327/float-float.jpg
http://dl.dropbox.com/u/1416327/fp32.jpg
http://dl.dropbox.com/u/1416327/fp64.jpg
NVIDIA
bad float-float is similar to amd fp32 photo

fix for float-float-> use precise
I hope this goes well with Fermi OGL 4.0 drivers and also enable precise keywork for GL 3.0 hardware..
Cg has a trick for disabling optimizations so it's not needed..
search blog for more info..

vec2 dblsgl_add (vec2 x, vec2 y)
{
precise vec2 z;
float t1, t2, e;

t1 = x.y + y.y;
e = t1 - x.y;
t2 = ((y.y - e) + (x.y - (t1 - e))) + x.x + y.x;
z.y = e = t1 + t2;
z.x = t2 - (e - t1);
return z;
}

vec2 dblsgl_mul (vec2 x, vec2 y)
{
precise vec2 z;
float up, vp, u1, u2, v1, v2, mh, ml;

up = x.y * 4097.0;
u1 = (x.y - up) + up;
u2 = x.y - u1;
vp = y.y * 4097.0;
v1 = (y.y - vp) + vp;
v2 = y.y - v1;
//mh = __fmul_rn(x.y,y.y);
mh = x.y*y.y;
ml = (((u1 * v1 - mh) + u1 * v2) + u2 * v1) + u2 * v2;
//ml = (fmul_rn(x.y,y.x) + __fmul_rn(x.x,y.y)) + ml;

ml = (x.y*y.x + x.x*y.y) + ml;

mh=mh;
z.y = up = mh + ml;
z.x = (mh - up) + ml;
return z;
}

Posted in | No comments

Thursday, 1 April 2010

Some things I forgot..

Posted on 19:40 by Unknown

First is directcompute blog
http://www.yakiimo3d.com/
with nebularot code
also seems rigid body on gpu is starting
before physx sdk 3.0 and batman with a gtx480 will use it I have found on nvidia ftp:
BatmanAA_GTX480and470_PhysX_Patch.zip
dated 30 march 2010
and has rrb.dll that is
gpu accelerated rigid body dynamics v 1.0.0.1 dated 11 january 2010
depends on cudart 3.0 patch 9
exposes

AgPmDestroySourceConnection
AgPmEventEnabled
AgPmEventLoggingEnabled
AgPmSubmitEvent
PrbCreatePhysicsSDK
PrbFree
PrbGetPhysicsSDK
PrbMalloc
PrbMallocDEBUG
PrbReleasePhysicsSDK

similar to physxcore

AgPmDestroySourceConnection

AgPmEventEnabled

AgPmEventLoggingEnabled

AgPmSubmitEvent

NgCreateCoreSDK

NpCreatePhysicsSDK

NpGetFoundationSDK

NpGetPhysicsSDK

NpGetPhysicsSDKAllocator

NpGetUtilLib

NpReleasePhysicsSDK

NxCreateCoreSDK

NxGetValue

batman originial release had this?

Also note cudart 3.0 patch 9 is found in physx runtime 22 feb 2010

note we have cuda 3.0 rt dll
8 beta
9 physx
11 optix2b3
14 final

Posted in | No comments

Megapost!

Posted on 17:13 by Unknown

Today fools{

*GTX 485 is 512 cores 3gbytes gddr5 and 850/1750 shaders..

*ati 5990 has 4 gpus in board..

*bulldozer benchmarks

}end fools..

ATI has released:

*5870 2gb 6 outputs

*GL 3.3/4.0 drivers (linux &win)

*GPU perfstudio 2.2

*AMD ADL SDK 3.0 (aka eyefinity sdk)

two stream documents:

*OpenCL Programming Guide

*GPU Computing: Past, Present and Future with ATI Stream Technology michael chu

lame to see backup slide cuda vs opencl..

*vaapi with h.264 decode on westmere cpus on git

well we have now h.264 gpu decode on linux via vaapi for intel nvidia and amd cards..

well amd with 5xxx not ok and intel g45 will wait until q3 2010..

also what about vc-1? ati and nvidia support is there even on 8800gt via latest vdpau..

intel will catch up?

and what about dual hd decode is working with every api/implemenation on latest gpu's all intel hd 2010 graphics amd 5xxx and gt240 and fermi have hardware suport for it..

what about h.264 mvc vaapi exposes it? i.e. api allows that and what about xvba,dxva and vdpau..

also now we have cuvid for mac even in x64 possible so cuvid will allow or allows mvc?

also now gnash vaapi support is integrated in trunk and compilable in mac and windows seems so we can

port vaapi to mac and win and even implement a cuvid vaapi wrapper?

this would allow mplayer and gnash to support gpu video decode on mac for nvidia cards for hd video and flash video..

Nvidia has released:

*Nexus march beta (same as shown in GDC'10 so would allow d3d10 and d3d11 shader debug on Fermi..)

*Optix 2.0b3

*CUDA 3.0

*OGL 3.3 drivers

Still lacking

*Cg 3.0?

*OGL 4 drivers with ext_image_load_store and ext_image_atomic_counters support

*Linux Fermi drivers (win has 197.17)

*3d vision surround sdk

*3dtv hdmi 1.4 drivers

*256 drivers

*nv d3d11 sdk presumably has:

hair tess and water tess demos

*physx sdk 3.0 with rigid body on gpu and height field water as fermi launch demo?

*voltage tweakers software and max oc with it for gtx 480 (900mhz?) and 470(750/800mhz possible) and bencharmks

*optix 2 and nexus 1.0 final

*test voxilla demos and fp64 in cuda and opencl perf cud-z

is 1/4 of tesla products? can be hacked? see ptx code and cubin code..

gpu computing:

*cudart x64 for mac

*cuda-gdb for mac

*cuda-gdb support for ocl binaries

*promised nv official tools for diassembly and assembly of fermi binaries (new cubins old use decuda or also will support sm_1x binaries?) promised soon in sigg asia cuda perf optimization course..

*mac cuda-opengl efficient interop?

official perf

*tesselation 6-8x

*raytracing 3.5x

*sli near 2x on d3d11 games

*3d vision near 2x (see 3d vision blog)

ok but rops and texture power very low and seems tex units capped at half

as gf104 info surface has 64 tex units also..

nvidia agrees has gddr5 controller problems so no uses gddr5 5000mhz chips to 1250mhz..

470 seems use 4000mhz chips..

reviews notes:

*noticias3d has slides and perf vs 5870 with launch 8.66 drivers so can be good to test perf improvement overall as this would be the perf six months ago.. cat 10.2/10.3 have 10% perf improvement..

*ixbt uses rightmark geo shaders perf..

*anandtech has chen nqueen opencl perf. and folding@home new client but other site claims on 50% perf vs 2-4x improvement anand says

*review have new d3d11 bencharmk by sweden company

*sandra 2010 gpgpu benchmarks but double prec is bad..

*d3d11 games metro, heaven 2.0, dx 11 sdk tess demos, just cause2 benches..

*luxrays perf on beyond3d forums..

Apple has released 10.6.3 without amd cal libs (see pgi 10.2 with cal info saying aticalrt.dylib)

also seems to have almost ogl 3.0 for amd nvidia has some extension less and cpu driver lacks 3/4 extensions..

I have found fermi on ogl binary driver but not support really..

phoronix found ogl drivers has more than 50% perf degradation on 9400 (bad)

but should allow steam to run on mac well..

regarding opencl still no new headers for cuda sdk 3.0 issues and seems no big improvements as no mentioned on release

I have to test 10.6.3 with a cuvid x64 executable i have, optix 2.0b3 sdk, run fft opencl and ocean apple demos on both nvidia and ati gpus.. and run nvidia ocl ft3d sample which says has issues with apple opencl to see if fixed..

also ocl headers in ipad 3.2 sdk golden master?..

still seems no double support on opencl for nvidia and no image support for ati gpus on apple..

add that to no fix double prec on compute shaders..

double in ogl 4.0 (ext_gpu_shader_fp64):

Nvidia has released ogl 3.3 but 4.0 drivers will support fp64 on

gt275?

also double support is on 4850 cards on ati 4.0 drivers?

also will nvidia release wgl_nvx_dx_interop spec and ext_image_load_store extension on gl 4.0 drivers?

any extension more?

with that at least directcompute and ogl will allow 3d image writes.. opencl allows 2d image writes by default and cuda least good? with from pitch linear mem..

lacking is opencl 3d image writes extension and cuda surface functions removed from cuda 3.0beta.. I think they didn't work..

also a post is interesting in nvidia forums saying that now opencl using a writable texture seems to not

Iz3d 1.11 released has shutter support (i can't test in samsung 120hz because I have activation issues)

but I have found anaglyph which shows algorithm goes good d3d9,10 and 11 in directx sdk samples..

lame ati d3d11 mecha ladybug doesn't work ok..

mecha crashes and ladybug doesn't affect view..

nvidia compute shader ocean demo doesn't see good and 3d vision works 197.13 with that demo!

also some tesselation doesn't work

brief:

*32 bits ok 64 bits examples crash (its my system fault?)

*Youtube 1080p 3D HD works with internet explorer with flash 10.0 not 10.1 and with youtube in english mode!

*Windowed stereo mode works.

so nvidia has to add youtube 3d and windowed stereo mode support (for non quadro) in 256 magical drivers.. better if they add also nvidia 3dtv and hdmi 1.4 out for opengl qb for quadros..

Also diagnostic utility reports about ati aqbs surface format d3d which must be amd catalyst 10.3 3d support shows is not supported altough using catalyst 10.3 whql so seems I must have lcd setup to 120hz or finds a hdmi projector? anyway can't setup hz on catalyst cc now..

I would love to have cuda hook that allows to enable graphics interop trough host for tesla computing driver on windows and running kernel moduly only on linux to run nbody for example..

it's a shame ogl interop was through host if not run on same gpu on earlier versions not it returns error..

also for opencl which reports ogl interop..

both for d3d and ogl interop..

also would add a cubin to ptx on the fly for running nufft or fastest matmul cubin codes on fermi..

also test enabling cu-force-ptx-jti

Would be good to test d3d ocl interop with dxva 2.0 d3d9 tex? interop to build a open source badaboom..

I would love to see on a 8800gt or gt200 with vp2 (vc-1 vld not supported) where we have lower cpu usage if using cuvid, dxva or vdpau.. assuming all these handle it..

the same for dual stream hd and mvc when it gots out..

Currently I found lacking on AMD 5xxx:

*OCL image support

*OGL-OCL tex interop

*xvba 5xxx incorrect decoding

I would love to have a simple ogl qb driver with anaglyph output for testing porting gnash, mplayer etc.. to support 3d stereo rendering and youtube 3d on mac and linux..

Then port 3d vision to these oses..

note I have learnt from Unigine Heaven 2.0 that iz3d doesn't work from launcher but it has .bat files for launching the demo and with that iz3d works in d3d9, in d3d10 crashes as soon as activated and d3d11 depends but no sees good..

note seems windows demo compiled on 7 march has no support for amd old tesselator gl extension editing haven.cfg so doesn't work also doesn't work with amd ogl 4.0 drivers..

on linux you can use heaven 2.0 with ati tesselation as linux build is later..

I would like atioc utility on linux to overclock much than officially supported as msi afterburner does..

have to hook ati adl and see..

angle google code project is improving fast:

*now has ogl samples included with esut.h and support for loops in shaders etc..

*64bit requires

--- src/libEGL/Display.cpp (revision 49)

+++ src/libEGL/Display.cpp (working copy)

@@ -63,8 +63,8 @@

}

else

{

- EGLint minSwapInterval = 4;

- EGLint maxSwapInterval = 0;

+ int minSwapInterval = 4;

+ int maxSwapInterval = 0;

Index: src/libGLESv2/geometry/vertexconversion.h

===================================================================

--- src/libGLESv2/geometry/vertexconversion.h (revision 49)

+++ src/libGLESv2/geometry/vertexconversion.h (working copy)

@@ -122,7 +122,7 @@

static const std::size_t finalWidth = N+(N&1);

};

-template

+template

struct WidenToFour

samples require more changes also..

I have been trying to port

Crazy drivers:
amd:

cat 10.2 B_95228 3/2
cat 10.3b B_95437 5/2
cat 10.3 B_96537 3/3
10.3a B_97263 14/3
10.3 ogl4 B_97624 24/3
10.3b B_97763 25/3
10.4 shipping for ubuntu 10.4

nvidia

196.75 required nexus support
197 or higher ->ocl d3d interop
197.13 cuda 3.0 oficial ones and whql
197.15 ogl 3.3 driver
197.16 notebook verde driver with 3d vision external support
197.17 fermi launch press drivers
197.25 starcraft dx8 issues

geforce 256 in april with 3d vision surround

about ogl 3.3/4.0 drivers

ogl 3.3 samples released..

ogl 4.0

openglext and extensions viewer show ogl 3.3/4.0 extensions

google code gle,gloader load 4.0 extensions.. glew?

info released about http://developer.download.nvidia.com/opengl/specs/GL_EXT_gpu_memory_info.txt

Fermi post launch analysis:

lacks
http://forum.beyond3d.com/showpost.php?p=1414824&postcount=283
latest gpgpu releases:

*thrust 1.2
*jacket 1.3
*Folding@Home fermi with openmm? gpu3 client
*cudpp 1.1.1?

released:
*nvidia Design Garage
*supersonic sled

demos not public:
*Raging Rapids tech demo
*hair demo
*water tesselation demo
*d3d11 demo by sweden company

testing cufft I have found since 2.3 includes nufft cubin only improvements (nufft paper sc09)
nufft has test bench code for 256^3 fft trasnform.
cufft in sc09 has perf over 160gflops for 256x144x192
cufft 3.0 only superfast if power of two every dimension altough different..
if not 20-30glfops

have to test fft dx compute shader microsoft library..

amd 5850 in glext shows

*doesn't have:

GL_EXT_stencil_two_side?

GL_ARB_compatibility (3.1)-> seems present so it present if 3.1 queries?

GL_EXT_shader_image_load_store->present in dll!
accessorStore UAV_STORE
imageLoad imageStore

GL_ARB_shading_language_include->seems has include basic support!

has:

GL_EXT_vertex_attrib_64bit (no published spec)
GL_ARB_texture_compression_bptc->tiene ext
GL_EXT_shader_atomic_counters (no published spec)
imageAtomicAdd imageAtomicSub imageAtomicMin imageAtomicMax
GL_ARB_texture_swizzle->tiene ext_texture_swizzle
GL_ARB_texture_buffer_object_rgb32->tiene ext

propietary
add GL_AMDX_debug_output
amdx->GL_AMD_name_gen_delete
GL_AMD_conservative_depth

-------------------------------------------
Not implemented extensions in OpenGL 2.0:
GL_EXT_stencil_two_side

-------------------------------------------
Not implemented extensions in OpenGL 3.0:
GL_NV_depth_buffer_float->tiene arb_depth

-------------------------------------------
Not implemented extensions in OpenGL 3.1:
GL_ARB_compatibility

-------------------------------------------
Not implemented extensions in OpenGL 3.3:
GL_ARB_shading_language_include->no ned
GL_ARB_texture_swizzle->tiene ext_texture_swizzle

-------------------------------------------
Not implemented extensions in OpenGL 4.0:
GL_ARB_texture_buffer_object_rgb32->tiene ext

-------------------------------------------

*I can't speak but I have betas of both:

*OpenRL 1.0b2

Has Windows (x32,x64) libraries and Mac x32 only libraries

still lacking linux and mac x64 binaries..

remember optix has mac also but only x32..

no opencl bits found anywhere and support from now and only cpu release

but uses all my 8 cores..

would be nice to port optix to OpenRL samples and tutorial and viceversa..

or better make a OpenRL wrapper to Optix 2.0b3 with fermi support..

*Intel Compilter 12 (composer 2011)

cilk,#pragma vector size(4,8) etc..

vs2010 support

aes-ni for crc32 and better avx overall

and more..

ipp 7.0 beta

intel compiler 12 beta

tbb old version?

*I have code libecuda,libptx of PFC ptx emulator of UPC now..

trying for windows and update to ptx isa 2.0

*Still no gdebuggerCL

ptx 2.0 isa released:

includes ptx 1.5 info also (llvm ptx nvidia opencl compiler emits this code)

-> mainly adds separate tex and sampler setup also same __param stuff as functions arguments

-> also shows opencl has no name mangling for kernels ocl.. now testing if ptx with addc can be inserted on opencl i.e. conversor from cuda 3.0 ptx kernels to opencl ptx kernels would be good..

also a cubin to ptx is possible? would allow me to run fastest to date matmul on fermi as fermi doesn't run cubins..

there are some limitation? ask barra creator he has a tesla cubin simulator..

so I could theoretically go from cubin to ocl compatible ptx code..

ptx 2.0 shows for fermi

HAS (also implemented):

*d3d11 cs 5.0 integer instructions

*ldu

*unified address space ld loads

*surface functions (load and store)-> 3d image writes

has load with format or not and with format loads are not implemented and stores with format also not implemented excepting a b32 format

EXPOSES:

*recursion via..

*functions calls with stack (so recursion possible) without defining and abi

*calloc function

*variable args to functions

note this is not implemented in 2.0

lacking still are:

*jump to register/pointer or call to register/pointer (virtual functions?)

*host system calls malloc,printf..etc..

also cuda book shows:

*fermi predication based on

"A Comparison of Full and Partial Predicated Execution Support

for ILP Processors"

*fermi supports terminating kernels when you want (driver stability improvements?)

also for load balancing..

*cuda fermi implementations priorities.. virtual unified space can take years..

*virtual address space good with GMAC approach for unified unique address for CPU GPU mem now GPU address is unified

Posted in | No comments

Sunday, 21 March 2010

What's for CUDA 3.1 and OpenGL 3.3/4.1!

Posted on 12:37 by Unknown

Let's see CUDA 3.0 vs beta:

*adds full blas support
*opencl local atomics
*ocl i cuda d3d9-11 interop..
*updated guides since beta..
still no ptx 1.5,2.0 specs..
also nv-cl extensions published now: http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/opencl_extensions/cl_nv_compiler_options.txt

Interesting notes.

*Float16 (half) textures are supported in the runtime

*cublas complete i ieee754 complaint fermi

*SGEMM performance on Fermi-based GPU is 30% lower than expected.

It will be fixed in 3.1.

*The stability of the large-prime FFT transform (signals with a length

that is prime and >64k samples) is extremely variable, giving single-

precision accuracy in the range 0.005->0.025. In general, smaller signals

experience greater accuracy.

*This package will work MAC OSX running 32/64-bit.

* CUDA applications built in 32/64-bit (CUDA Driver API) is supported.

* CUDA applications built as 32-bit (CUDA Runtime API) is supported.

(10.5.x Leopard and 10.6 SnowLeopard)

Note: x86_64 is not currently working for Leopoard or SnowLeopard

*CUDA applications built with the CUDA driver API can run as either 32/64-bit applications.

* CUDA applications using CUDA Runtime APIs can only be built on 32-bit applications.

SDK Release 3.0 Final:

* Replaced 3dfd sample with FDTD3d (Finite Difference sample has been updated)

* Added support for Fermi Architecture (Compute 2.0 profile) to the SDK samples

* Updated Graphics/CUDA samples to use the new unified graphics interop

* Several samples with Device Emulation have been removed. Device Emulation is

deprecated for CUDA 3.0, and will be removed with CUDA 3.1.

* Added new samples:

concurrentKernels (Fermi Capability)

* Bug Fixes

have added simplempi also..
have to test with intel mpi 4.0

MAC notes:
cuda.dylib is 64bit and has 195API and 195 185 dylibs versioned as 195_96 or 185_55..
*has cuda-memcheck but no cuda-gdb
*cuda kext is fatbin with 64 bits and also cuda.dylib so cuda driver applications are compatible with 64 bits
and compilable..
note also can boot in 64 bit kernel due to kext..
cudart 32 bit
then we can in theory program a cudart wrapper over cuda driver and compile in 64 bits more
now than cudart is stateless and has interop with cuda driver mem alloc..

all needed is cublas and cufft to be 64 bits compile in that..

we have code for cudpp,thrust and cusp and in the meanwhile volkov matmul,fft and lapack codes

so all these can be compiled with 64 bits if we had a cudart 64 bit and see what's up..
well I have compiled cudadevicedrv and matmuldrv
(i'm the first in the world to have 64 bit cuda apple binaries? excepting at nvidia..?)
I have get rid of cutil though compiling to 64 bits would be no problem some notes:
nvcc on mac defaults to 32 bits vs gcc defaults on 64 bits on Snow leopard..
so for using 64bits you must use -m64 in nvcc..
but for cuda driver projects nvcc is of no use since you can use g++ for cuda driver api and compile cuda
files to ptx with nvcc -ptx

if you use nvcc with -m64 you get both cpu 64 bit code but also using -ptx you get ptx code

using 64 bit pointers for Fermi?

so you can use 32 bit pointers in Fermi is better use 32 bit pointers..
so matrixmuldrv use nvcc -ptx for 32bit pointers and use g++ (-m64) and you get
but cudamoduleloaddataex i get error
CUDA_ERROR_POINTER_IS_64BIT = 800, ///< Attempted to retrieve 64-bit pointer via 32-bit API function
loading ptx either if I use a nvcc -m64 or nvcc (all with -ptx) get this error..
so ptx with 32 or 64 bit pointers doesn't change that..
I have to compare files with 32 and 64 bit pointers to see differences also with sm_20..
also note for nvcc -m64 to work either if it not needed needs /usr/local/cuda/lib64 to exist..
so I have copied lib->lib64 or do a symlink..
so you can now run it..
I have to write tutorial of using cuda and nvcc and achieving macos fat binaries(i386 ad 64)
*I see nvcuvid library for mac in gpu computing sdk.. only 32 bits..
/C/common/lib
and /C/common/inc/cuvid

Anyway I have a libcuvid (vs libnvcuvid) for 64 bits /usr/local/cuda (where i have get from?)

*also a pref pane control panel with autoupdate and shows gpu driver version and cuda driver version..

note opencl samples on mac no work until 10.6.3..

good is opencl not definided behavior (implementation specific) for nvidia:

http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_OpenCL_ImplementationNotes_3.0.txt

issues with mac..

opengl 4.1/3.3 perfect release:

*ext_direct_state_access

*ext_separate_shader_objects

*RW textures (3d also) ext_image_load_store

*binary shaders (gl es 2.0 api)

in theory you can use some ir from 3dlabs frontend compiler source..

or also translate to hlsl via som translator (amd hlsl2glsl?) and then use binary hlsl shader..

also a good translator..

http://code.google.com/p/angleproject/

has flex/bison glsl parser and also a glsl2hlsl translator (es 2.0)..

going from binary to dx il via:

fxc /dumpbin

but dx il to binary? also how from dx il->hlsl or glsl directly..

I also have found wine handles/parses more or less dxbc files..

/dlls/d3d10/effect.c

static HRESULT parse_shade

NV OGL extensions:

*fermi fuction pointers and recursion for glsl?

would be good addition to bindless extensions and shader buffer load..

CUDA 3.1:
*cuda-gdb OpenCL HW debugging support..
*pinned GPU mem interop with MPI Infiniband.. (spring10 in sc09)

*template for a DirectCompute project
Currently there is no template for a DirectCompute project, but NVIDIA will be
providing one soon.
*Fix perf of CUBLAS SGEMM by 30% faster on Fermi
*Fix CUFFT perf vs 3.0beta goes 180-190gflops to 150gflops
*provide official cudaasm/decuda or documentation about cubin/ELF format for SM_20 devices? also for sm_10?
*PTX 1.5, 2.0 docs?
*Updated opencl best practices for Fermi? cuda best.. guide is updated but for Fermi?

*Surface functions: RW textures with x,w addressing etc.. also 3d image writes.. headers and exported functions in beta but removed in final..

Also CUDA to CPU compiler or is gpuocelot mature enough and also mac and windows ports avaiable..
would be good a direct PTX2CPU code conversor and using gpuocelto lib as cudart and cuda api..

Mac

*add cuda-gdb (with ocl also) and OpenCL visual profiler

opencl mac no xutan 2 ejemplos

cuda opengl slow mac

ship

is going to work with fermi cuda.kext

*Related is first 195 series 197 whql driver for Quadros enabling OpenCL on these devices..

Adds support for CUDA 3.0 for improved performance in GPU Computing applications. See CUDA for more details.
This driver resolves fan speed issues reported with version 196.75 drivers.
Adds support for the Open Computing Language (OpenCL) 1.0 in Quadro FX Series x700 and newer as well as the FX4600 and FX5600.

*Nvidia mentions compute cluster driver but is 196.28 not updated since early feb.. anyway d3d interop
added finally is not nedeed here..
*
to pierre boudier you cansee ogl 4.0 drivers soon and also a image write and random access extension soon ala d3d11 rwtexture..
ubuntu 10.4 fglrx 8.72

fglrx-installer (2:8.721-0ubuntu1) lucid; urgency=low

* New upstream release:
- Restore compatibility with kernel 2.6.32 and xserver 1.7 (LP: #494699).
- Add Passive Stereo support on workstation (FireGL/FirePro) hardware.
- Add Eyefinity support (more than 2 monitors on Radeon HD 5xxx hardware).
Officially WS-only but should work on consumer boards as well.

GL_EXT_shader_subroutine GL_EXT_timer_query

Also what about 3d stereo on linux:
*3d vision for opengl qb on quadro with stereo connector is here..
*a 3dtv for linux so opengl qb can be output to hdmi 1.4 on linux? this can add working on low profile quadros as stereo connector is not needed (is not needed in 3d vision is Nvidia way of artificially limiting to super high end quadros well expect perhaps better synch..)
also if they add VDPAU h.264 MVC and you decrypt bluray3d with anydvd hd you will be able in theory to see it in linux gpu accelerated decoding and sending to tv's via hdmi 1.4..
let's see also how windows is handled as not dxva 2.0 support it mvc? also not cuvid so leet's see if they add it to cuvid also..
so seems all cyberlink will get some library by nvidia or what?
*ATI has hooks for d3d9,10? d3d11? in 10.3, also fglrx 8.72 add passive stereo for ogl qb (active stereo is here right?.. but for 120hz lcds also?)
let's see also how ati manages output to HDMI 1.4 tv's via either IZ3D partnership or what? in fact I expect iz3d only hooks d3d stereo and the amd will add some HDMI 1.4 stereo from this hooks so will be good a sdk or documentation of this hooks..
Also Nvidia will be good publishing stereo sdk (promised in gdc2010) and hope also this hooks (d3d9-11) will work with 3dtv and output to hdmi 1.4 tvs.. In fact yes as Avatar and 3d stereo vision use this hooks presumably..
mac is out in this scope..

also nvidia can be late with fermi but not with software supporting it..
now d3d11 is with cs5.0 here and also we have now d3d11 interop for cuda in 3.0 and d3d11 interop with opencl extension and also optix d3d11 interop..
We have d3d11 interop with:
*CUDA 3.0
*OpenCL
*Optix
HW debugging:
Nsight.
All need to be released is nsight which will also bring d3d11 support (hw debug and profile) wii be good to hw debug cuda, d3d11 cs, cuda with d3d11 interop, and trace opencl and opengl (4.0? will be traced?)..

also cg 3.0 will have support for d3d11? and also sm5.0 opengl 4.0 support? i.e. tesselation shaders with glsl output?
note cgc 3.0 is shipping on tegra sdk and also as part of nvidia drivers 195 opengl compiler..
I have seen cgfx working with optix and cuda in a blog so hope they ship example soon..
http://lorachnroll.blogspot.com/2010/03/mixing-nvidia-technologies-thanks-to.html

GPU: GF100 @ 700MHz
- CUDA cores: 480 @ 1401MHz
- Memory: 1536MB GDDR5 @ 1848MHz 384-bit
- TDP: 250W
GeForce GTX 470:
- GPU: GF100 @ 607MHz
- CUDA cores: 448 @ 1215MHz
- Memory: 1280MB GDDR5 @ 1674MHz 320-bit
- TDP: 225W
- Price: $349US

- 3D APIs: OpenGL 4.0 and Direct3D 11
- GPU Computing: OpenCL, CUDA and DirectCompute
- 3-way SLI support

GeForce GTX 480 : 480 SP, 700/1401/1848MHz core/shader/mem, 384-bit, 1536MB, 250W TDP, US$499

GeForce GTX 470 : 448 SP, 607/1215/1674MHz core/shader/mem, 320-bit, 1280MB, 225W TDP, US$349

Note also we have like GLSL and OCL vec4 and other C++ libraries:
*GLM has GLSL strict compliance..
even with GMX experimental extensions we have SIMD implementations..
*DX SDK feb 2010 has XNAMATH 2.02 SIMD math library
also read:
http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php

HDR good maps:

http://www.hdrlabs.com/sibl/archive.html

Nvidia employess blogs:

http://timothylottes.blogspot.com/
http://jamesdolan.blogspot.com/
http://industrialarithmetic.blogspot.com/
http://castano.ludicon.com/blog/

http://twitter.com/castano

http://twitter.com/tmurray_cmpxchg

showing max cuda mem:

http://forums.nvidia.com/index.php?showtopic=102682 cuda maxmem

caustics patents:

US patent applications: 20090096788, 20090096789, and especially 20090128562,

The LLVM 2.7 binaries are available for testing:

http://llvm.org/pre-releases/2.7/pre-release1/

http://amnoid.de/tmp/clangtut/tut.html

http://lists.cs.uiuc.edu/pipermail/cfe-dev/2009-May/005167.html

http://synopsis.fresco.org/

Performance inconsistencies when testing various bit-counting methods

ubuntu cheat cube:119834-cheat-cube-ub

ie9 VML to SVG Migration Guide

windows phone 7:

*xna ctp 4.0 avaiable works with pc but only reach profile not hidef..

*unlocked image with all apps instructions on a blog..

*petzold samples and book excerpt avaiable..

*also sqlite port ->csharp-sqlite.wp

Windows 7 XP Mode now has support for CPUs without virtualization VT-D support..

Windows 7 SP1 virtualization news:

With Microsoft RemoteFX, users will be able to work remotely in a Windows Aero desktop environment, watch full-motion video, enjoy Silverlight animations, and run 3D applications," Microsoft's Max Herrmann writes, "All with the fidelity of a local-like performance when connecting over the LAN."

cuda will work with it? i.e. no need for compute cluster driver and also ogl,dx and interop support..

Q: Will RemoteFx support also OpenGL hardware acceleration which is the 3D high level API used by professional applications like CAD systems or medical applications ?

A: RemoteFX will support certain OpenGL applications. However, as the development of RemoteFX is still ongoing, it is too early to provide any specifics at this point.
Q: Are you plan to introduce RemoteFX also for Windows 7 because their are many scenarios where the remote system is not a server but a high end workstation ?
A: RemoteFX has been designed as a Windows Server capability to support the growing demand for multi-user, media-rich centralized desktop environments. Windows 7 will be supported as a virtual guest OS under Hyper-V.

Dynamic Memory is an improvement to Hyper-V which allows users to pool all available physical host memory together, and dynamically allocate it to virtual machines. In other words, if the workload changes, VMs can get access to extra memory without having to shut them down.

XNA forums:

Updated list of D3D12 suggestions

Unable to perform a recursive call with DirectCompute?

How to AttachBuffersAndPrecompute to ID3DX11FFT

RWStructuredBuffer counter

The IncrementCounter is faster than IterlockedAdd(Buffer[0], 1) in 4 times.

Gamefest 2010 presentations?

D3D11 / D2D Interoperativity

329M pairs/sec radix sort performance, 408M keys/sec - crushes CUDPP numbers

AppendStructuredBuffer driver bug?

How to debug DirectX 11 Compute Shaders?

Creating a Shared Surface with DXGI

atomic
I have some questions about RWStructuredBuffer:
1. How to copy hidden counter to system memory? CopyStructureCount
2. How to reset the counter to zero? last argument of OMSetRenderTargetsAndUnorderedAccessViews
3. Why the performance of this counter is much more than the performance of InterlockedAdd at the element buffer? (HD 5670)
The IncrementCounter is faster than IterlockedAdd(Buffer[0], 1) in 4 times.
How to AttachBuffersAndPrecompute to ID3DX11FFT?

http://gephi.org/

http://forums.xna.com/forums/t/49607.aspx
Thank you. I forgot about debug version of the D3DCSX. Debug message proved to be helpful. For the record: 1. The number of buffers attached must be exactly the same as in D3DX11_FFT_BUFFER_INFO. 2. The views MUST be created with the D3D11_BUFFER_UAV_FLAG_RAW flag (although it wasn't mentioned in documentation).

The Chrome dev channel release has support for an Open GL ES 2.0 interface

for Native Client. This is something we said we would do sometime last year.

When we consider it stable, documented etc. we will do more of an

announcement.

Google are announcing that NaCl now also supports x86-64 and ARM.

http://www.osnews.com/story/23021/Native_Client_Portability_Almost_Native_Graphics_Layer_Engine

NaCl_SFI:Adapting Software Fault Isolation to Contemporary CPU

Architectures

pnacl: Portable Native Client Executables

from GDC:

this are also graphics API translations:

Cider & Cedega: Direct3D on OpenGL

GameTree.tv: Direct3D on OpenGL ES

SwiftShader: DX Software Rendering (also WARP)

ANGLE Project: WebGL (OGL ES 2.0) on Direct3D

now we need GPGPU apis so:

cuda on opencl?

cuda on cal?

directcompute on opencl?

opencl on directcompute?

posted on opengl and cuda forums:

Questions to nvidia:

*Is Nvidia going to expose ext_gpu_shader_fp64 on GT2xx hardware with double precision or is for d3d11 hardware?

For example gtx275

AMD seems to support double precision on GLSL via doublepAMD even on 4850 cards..

Also is Nvidia with initial GL 4.0 drivers going to finally expose documentation for wgl_nv_dx_interop and have the shown at gtc texture writting and random access support?

via ext_image_load_store?

Please post PTX 1.5 and 2.0 documents..

Also I'm summing here things promised soon by Nvidia so let's see how much it takes before we get:

*cuda-gdb support for hardware debugging of OpenCL kernels

*cuda-gdb GPU debugger for Mac (with OpenCL support also)

Mac related:

Is mac 64 supported?

This package will work MAC OSX running 32/64-bit.

CUDA applications built in 32/64-bit (CUDA Driver API) is supported.

CUDA applications built as 32-bit (CUDA Runtime API) is supported.

(10.5.x Leopard and 10.6 SnowLeopard)

Note: x86_64 is not currently working for Leopoard or SnowLeopard

UDA applications built with the CUDA driver API can run as either 32/64-bit applications.

CUDA applications using CUDA Runtime APIs can only be built on 32-bit applications.

My mac notes:

nvcc matrixMul_kernel.cu matrixMulDrv.cpp -I../../common/inc/ ../../lib/libcutil_i386.a matrixMul_gold.cpp -Xlinker /usr/local/cuda/lib/libcuda.dylib

nvcc matrixMul_kernel.cu -c -m64

g++ matrixMul_gold.cpp matrixMulDrv.cpp -I../../common/inc/ -I$CUDA_INC_PATH -L$CUDA_LIB_PATH /usr/local/cuda/lib/libcuda.dylib ../../lib/libcutil_i386.a

para nvcc -m64 crea lib64 con copia de lib

nvcc -m64 deviceQueryDrv.cpp -I../../common/inc/ -I../../../shared/inc -Xlinker /usr/local/cuda/lib/libcuda.dylib

quita cut

nvcc defaults 32 bits

gcc defaults 64

g++

g++ deviceQueryDrv.cpp -I../../common/inc/ -I../../../shared/inc /usr/local/cuda/lib/libcuda.dylib -I$CUDA_INC_PATH

//#include

#define CU_SAFE_CALL_NO_SYNC(a) a

//CUT_EXIT(argc, argv);

export CUDA_BIN_PATH=/usr/local/cuda/bin

export CUDA_LIB_PATH=/usr/local/cuda/lib

export CUDA_INC_PATH=/usr/local/cuda/include

export PATH=$PATH:/usr/local/cuda/bin

Posted in | No comments

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Saturday, 10 July 2010

Some news!

Sunday, 4 July 2010

DirectCompute Double precision Mandelbrot demo and more..

A lot of things you probably don't know.. and a worth it..

Saturday, 3 July 2010

ATI Stream SDK roadmap

Wednesday, 5 May 2010

About AMD OpenCL 2.1!

Tuesday, 6 April 2010

Mandelbrot using OGL 4.0 features (double precision and precise keyword)

Thursday, 1 April 2010

Some things I forgot..

Megapost!

Sunday, 21 March 2010

What's for CUDA 3.1 and OpenGL 3.3/4.1!

Popular Posts

Blog Archive

About Me