Monday, 30 November 2009

Two big games coming today: State of the art Direct3D 11 and 3D Vision support!

Posted on 16:15 by Unknown

Dirt 2 : uses Direct3D 11 tesselation and compute shader!
Demo coming today also!

Avatar: according to NVIDIA:

Avatar is now being programmed directly to our 3D Vision driver, which gives the developer complete control over how to render their game in 3D. This results in numerous benefits:

1. Out of screen effects.
2. Water reflections rendered with depth.
3. Explosions, leaves falling, all rendered in 3D.
4. Crosshair rendered at object depth.
5. Full control over 3D convergence, based upon screen size and user distance.

Demo shipped two weeks ago.. use with 195.62 or later..

Posted in | No comments

News from the web (IV) (big compilation)

Posted on 16:07 by Unknown

I finish posting some news from this past month:

News about GPU debuggers:
1.Allinea DDT GPU debugger beta
2.Totalview:
http://www.totalviewtech.com/company/pressrelease.html?id=315

3.What about Nexus I'm signed in beta program and getted a mail a month ago saying adding accounts in two weeks.. I get no response..

4. Also what about GPU debugger by Kun Zhao I must check it..

More soft
Matlab GPU enabled beta upcoming?
Nvidia SceniX enabled with Optix avaiable (5.5)

PhysX SDK 2.8.3 avaiable.. I signed for an account and now I'm able to get both this an 30/09 runtime, all supporting x64 binaries..
See how!
GPU enabled web browsers coming: Both IE9 and Firefox getting GPU acceleration via Direct2D and DirectWrite.. Firefox build avaiable..

Both Unreal 3 engine and Cryengine 3 (state of the art engines) are now free for non commercial use.. Cryengine 3 is more restrictive and made avaiable only to instructors of universities..

Of course Nvidia announced in press notes in SC09 both Fermi based Tesla products for Q2 2009, features in OpenCL 195 driver:

OpenCL 1.0 Extensions: NVIDIA is the only vendor supporting OpenCL features beyond the minimum conformance level. New extensions released by NVIDIA include support for double precision, OpenGL interoperability and the new OpenCL Installable Client Device (ICD). These new features supplement existing NVIDIA-only support for 2D image, 32-bit atomics and byte addressable stores.

and also growing GPU Computing growing software ecosystem..

Also seem interesting these two papers:

CheCUDA: A Checkpoint/restart Tool for CUDA Applications
PyCUDA: GPU Run-Time Code Generation for High-Performance Computing
in gpgpu.org

and new site wanting to replace my blog :-)
gpucomputing.net

Posted in | No comments

Wishes in GPU drivers before Q2 2009!

Posted on 13:34 by Unknown

I put these as a check point for next year to see things how have evolved. Most of this are more o less seen or expected to be coming someday:

D3D11
=====
Doubles working in DirectCompute 5.0!

Nvidia
======

WGL_DX_interop extension published and working in Vista and higher ?
Fermi SM5.0 OGL extensions: SM5.0, tesselation, random acces pixel shader, atomics pixel shader, new tex formats, etc..
Test DirectX11 perf on Fermi: heaven bench and voxilla demos
Test DirectCompute 5.0 perf on Fermi: sandra 2010 bench
CUDA 3.0 features testing (whitepaper claims): Dual DMA, concurrent kernel, 2x cache in spmv, local mem size 5x in quicksort, atomics 10x, switching grahics and cuda 10x, C++, new malloc in kernels, calling host functions in kernels, etc..
I expect all of this at least for nvidia 200.xx drivers and say CUDA 3.0beta2

3D Vision
Linux USB for Quadro without mini DIN (entry cards) as supported in Windows (XP and higher)
Linix USB support hotplugging
Windows: Windowed support and web browser support
Open APIs for using 3D Vision builtin as Avatar

ATI
===

opencl:image support,5xxx extensions,etc..
opencl shipping in drivers with ICD modeñ..
opengl 3.2 plus 10.1 arb ext
opengl 5xxx extensions: tesellator, random access pixel shader,etc..

I should expect at least before or in time for catalyst 10.04 (for ubuntu 10.04)..

Mac
===
10.6.3 fix all OpenCL issues (see Apple FFT opencl lib issues)
some extensions towards OpenGL 3.0,3.1,3.2 ..
ati r5000 support?
fermi support?

Posted in | No comments

Saturday, 28 November 2009

CUDA Atomics perf!

Posted on 05:40 by Unknown

This post shares a study of CUDA atomics:
http://strobe.cc/articles/cuda_atomics/
Also see posts of Farrar:
http://farrarfocus.blogspot.com/2009/05/cuda-compute-11-global-atomics-profiled.html
http://farrarfocus.blogspot.com/2009/05/cuda-compute-13-global-atomics-profiled.html
http://farrarfocus.blogspot.com/2009/05/dx11-generation-atomic-operations.html

Posted in | No comments

GPU Compute benchmark results!

Posted on 03:27 by Unknown

Includes DirectCompute and OpenCL benchmarks..

Tested with:

nvidia 195.62
ati 9.11 whql
core i7 920
win7 rtm

directcompute 0.35 benchmark
============================

nvidia
======
dx cs_4_0
12371

opencl
59515

ati 5850 oc (950mhz)
=========
dx cs 5.0
144815
cs 4.1
139568
cs 4.0
139458

opencl
519486

cpu
===

2900

Sisoft sandra 2010 lite
=======================

All are mpixels. All float and then double perf. All doubles emulated (why Nvidia OpenCL and Nvidia CUDA?)

nvidia gtx 275
==============

opencl

445
30

cuda
441
51

dx cs
523
26.1

ati 5850
========

dx
2000
100

stream
745
370

ati 5850 oc (950mhz mem 1100mhz)
======

dx
2510
133.6

stream
976
484

opencl cpu (core i7)
===
59
33

Posted in | No comments

Interesting AMD Stream forums posts! (old posts)

Posted on 02:55 by Unknown

I have compiled over time some interesting post links in AMD Stream forums: some explain things at the time not documented, some others interesting code, or projects done or using it, etc..

Here it goes..

Released GPUwareC with GWSDK 0.5.1 test

GPUware C 0.5 test release

The GPUware C compiler allows one to code in a C-like language to construct AMD GPU kernel calls. It allows one to program with AMD's CAL, to produce high performance software, and still code kernels in a high level language. The AMD IL produced by the compiler is very readable, and can be easily modified by hand.

Can LDS reads be "broadcast" within a wavefront"?

LDS : More info requested

Features of Stream SDK 1.2?

Double memory copy in CAL ? What about calCtxResCreate ?

GPU memory architecture?

Measuring HD 4850 performance 1tflop shader

You can have max 128 registers per thread. Number of wavefronts that can be executed on a single SIMD is decided by register usage in your shader (Total registers per SIMD are 64*256).

bursting global reads and global memory bandwidth?

global GPR vs. global data store (another)

As for PV/PS, you cannot turn them off and you really would not want to turn them off as they provide a performance bonus over normal register usage.
Maximum 2D stream dimensions supported is 8192x8192 and 1D dimensions suported is 2^26.

Either you can rearrange data to match these dimensions or you can try changing algorithm to execute data tile-by-tile on GPU (Take a look at out of core MMM in samples/CPP/apps). 4870 is also having the same limitation.

About r7xx arch
Ok,
So there are 163840 registers on the RV770. There are 10 SIMD's, so that gives us 16384 registers per simd, or 16K x 128bit as specified in the Registers per SIMD Core row.
Now, the article states right above the table that there are 64 threads per wavefront. So, 16384 / 64 gives you 256 registers per thread.
If you run a problem domain of 1026 * 1026, assuming 1 thread per location, that gives you 1,052,676 threads that need to be executed.
Divide that by the wavefront size, gives you 16449(must round up) wavefronts that will be spawned by the GPU for this domain.
Now, lets assume that you have 5 registers per thread(which can be determined from KSA disassembly), this lets you run a MAX of (256/5) = 51 wavefronts in parallel per SIMD, or 510 at a time on the GPU.
So this means that you have enough wavefronts to fill up the GPU at least 32 times.

So, assuming that your application gets all of the resources on the chip, this is what you should expect. However, because of other constraints this is the best case scenario and not the average case. So this should give you some idea about what you can do.

Hope this helps. That review article is fairly well done and if you analyze it with a compute mindset you can figure out a lot of things that are docs don't currently specify.
http://www.anandtech.com/printarticle.aspx?i=3341

Using 4870x2
============

Many wasted hours later, I think I've found my problem. I needed to call calCtxIsEventDone() after calling the kernel for each GPU to allow the concurrency to occur. Seems like a messy trick - perhaps the multiGPU paragraph in the user guide could be expanded to mention this?

Take a look at section 2.16.3 of stream computing user guide to see how to use multiple GPUs in single thread. I would suggest to create seperate threads for multiple GPUs as leveraging kernel asynchronous call requires lots of tuning and the call might not be asyncronous in some cases. Take a look at Brook+ sample MonteCarlo_MultiGPU and tutorial MultiGPU.

Measuring CAL time
==================

The correct way to time a CAL kernel is to follow this pattern:

flush
start timer
execute kernel
wait on event
stop timer

As for PV/PS, you cannot turn them off and you really would not want to turn them off as they provide a performance bonus over normal register usage.

Posted in | No comments

Testing my apps with 8600GTS and WinXP!

Posted on 02:43 by Unknown

Testing my apps and other GPU computing benchmarks with a desktop with 8600GTS with 256MBytes and Windows XP!
It's a normal user desktop so no Visual Studio installed, etc..

I have learned:

+ Some of my apps (OpenCL-z and matmul) need Visual 2008 runtimes (either normal or SP1 ones..) Downloads links as comments in my respective apps posts..

+ My matmul demo needs ATI CAL libraries to be present in the system (it's as easy as downloading an ATI driver searching for files:
in Packages\Drivers\Display\{XP,W7,XP6A,W76A}_INF\B_xxxx\
and copying aticaldd.dll aticalcl.dll aticalrt.dll.
Most sure this files are compressed and name .dl_ so use expand name.dl_ name.dll.

+ My apple demos go fine excepting Nbody because of memory issues.. I have coded parsing commandline supporting setting number of particles but no luck not checked it and it isn't working.. Fixed in upcoming source code.. so set -n 1024 for example..
Also Procedural_Grass_and_Terrain has some visible errors I thought it was because of mem issues so I forced disabling AA in Nvidia Control Panel as is somewhat higher (8x) to reduce mem pressure but erros still present..

+ In Windows XP OpenCL Volume3D goes nearly as fast as CUDA one.. In fact I get suustained 60fps in CUDA vs 40-60 fps in OpenCL.. Note rotating the volume with the mouse I get also 58fps sustained fps in OpenCL so seems mediocre getting 14fps in a high end desktop with gtx 275 in OpenCL in Win7..

Posted in | No comments

A lot of Catalyst AMD drivers!

Posted on 02:28 by Unknown

In two months time a lot of Catalyst drivers have been released and things now are somewhat of a crazy situation..
Only telling about 9.10, 9.11 and 9.12 based..
See it:
Note: i base month based on 8.66=9.10 8.67=9.11 8.68=9.12 ..
9.10 beta AMD 58xx launch day (8.660) (search MSI)
9.10 beta AMD 57xx launch day (8.660) new beta (search hotfix AMD 57xx in AMD KB)
9.10 WHQL (Windows 7 launch day) (8.661?)
9.10 NFS Shift and Hemlock driver (8.663) (search in web for it)
9.11 OpenCL beta (8.670)
9.11 WHQL (8.671)
9.11 Dirt2 CrossfireX hotfix (8.673)
9.12 beta (8.68) beta for WinXP leaked

My advice is stay with 9.11 WHQL normal user.. 9.11 Dirt2 hotfix for testers or want to play Dirt2 and for testers in Windows XP 9.12..
All of that excepting NFS and Hemlock users to stay with 8.663?

9.12 seems to include a lot of OpenGL 3.2 extensions and 10.1 extensions..
also has CAL 1.4.492 newer than required for OpenCL 1.4.467..

also note for Linux guy of VAAPI XvBA names testing fglrx 8.69 (so 10.01)
also note guys of Unigine Heaven benchmark name Linux demo using tesselation are hoping for a January release pending AMD drivers supporting it (I still don't know if they are waiting for full OpenGL 3.2 support, or DX 11 tesellation ext as tesselation for 4xxx cards is presently supported trough propietary extension..)
So I hope 10.01 to be a good driver for OpenGL users (OpenGL 3.2 and 5xxx extensions).. as good as 9.1 was (supported 3.0)

Posted in | No comments

Friday, 27 November 2009

News from the web III

Posted on 08:28 by Unknown

1.Nvidia NPP 1.0 released: Image processing library for GPUs (CUDA).. similar to Intel IPP

With that Nvidia has similar performance libraries as Intel CPUs:
MKL(dense BLAS,fft,sparse matrix ops,lapack)<->CUBLAS,CUFFT,CUSP,MAGMA and CULAtools
NPP<->IPP
http://forums.nvidia.com/index.php?showtopic=151671

Nvidia also has CUDPP and Thrust por parellel primitivies (sort, reduction, etc..)

2.D3Dgear 3.55 fixes all known issues for D3D11 apps (see previous post)
Still with Fraps 3.03 not getting FPS GUI in Win7..

3.Mandelbulbs, Mandelbrot sets transformed into 3D
Includes Optix for raytracing (uses CUDA) in GPU
Also a GigaVoxel port in progress..

Posted in | No comments

Thursday, 26 November 2009

News from the web II (big compilation)

Posted on 02:25 by Unknown

*SC09 news:
*Larrabee perf exposed on matmul and sparse matvec mul..

On the SGEMM single precision, dense matrix multiply test, Rattner showed Larrabee running at a peak of 417 gigaflops with half of its cores activated (presumably the 80-core processor the company was showing off last year); and with all of the cores turned on, it was able to hit 805 gigaflops. As the keynote was winding down, Rattner told the techies to overclock it, and was able to push a single Larrabee chip up to just over 1 teraflops, which is the design goal for the initial Larrabee co-processors.
Here's the next problem. Sparse matrix math is what is commonly needed in simulations involving cloth and water. And on that test, a Larrabee chip that was not overclocked was able to do between 7.9 and 8.1 gigaflops, depending on the test and the size of the matrices.

But what he did say is that the Ct dialect of C++ that Intel has created will be going into beta soon to help with the parallelization of C++ code to run on multicore and multithreaded processors, and more importantly, to spread code across CPUs and GPU-based co-processors in workstations and services to maximize performance as transparently as possible. Ct will work in conjunction with the CUDA environment from Nvidia for its GPUs and for the OpenCL environment being pushed by Advanced Micro Devices and others.

Intel is also cracking the issue of sharing data between Core and Xeon CPUs and Larrabee GPU co-processors. Future Core and Xeon chips will be able to create a virtual shared memory pool that both the CPU and GPU can access so datasets are not crunched down, serialized, and moved over the PCI-Express bus from the CPU to the GPU and then back again after calculations are done. The shared virtual memory allows the CPU and GPU to work off the same data in sequence without any movement, which should radically improve performance and smooth out simulations.

*Improved Nvidia GPUs and Infiniband interop: it allows to use pinned mem for both GPU and Infiniband devices (Mellanox drivers and CUDA release around Q2 2010).. avoids copy on host mem.. or avoiding pinned mem.. still lacking general way of using GPU DMAs to send to other DMA devices
*CUDA 3.0beta and drivers public..
*OpenMP work towards 3.1-4.0

*Magma 0.2 released without source (expect in december) still no OpenCL support..

* LU, QR, and Cholesky factorizations in both real and complex arithmetic (single and double);
* LQ and QL factorizations in real arithmetic (single and double);
* Linear solvers based on LU, QR, and Cholesky in real arithmetic (single and double);
* Mixed-precision iterative refinement solvers based on LU, QR, and Cholesky in real arithmetic;
* Reduction to upper Hessenberg form in real arithmetic (single and double);
* MAGMA BLAS in real arithmetic (single and double), including gemm, gemv, symv, and trsm.

See:
http://icl.cs.utk.edu/projectsfiles/magma/pubs/MAGMA-BLAS-SC09.pdf
http://icl.cs.utk.edu/projectsfiles/magma/docs/magma_roadmap.pdf
*Cula 1.1:

Here is a subset of the improvements that have made it into this release:

* Exciting new functions including general Eigensolver (Premium Feature)
* Bridge interface for migrating currently existing LAPACK/MKL code
* Better documentation including a full API reference
* New examples constructed from user feedback
* More performance!
* Mac OS X support (Preview)

eigensolvers in pro version
now supports Mac though only Leopard and single precision: what about Snow Leopard and double precision?
*OpenMM 1.0beta released: OpenCL preliminary support.. still no binaries with it!

This release adds support for Particle Mesh Ewald, arbitrary forms for non-bonded interactions, and preliminary support for OpenCL.

*Apple OpenCL FFT lib: seems very high perf. only Mac..perf issues until 10.6.3?

Currently supports 1D, 2D, 3D batched complex-to-complex transforms (inverse and forward) both in-place and out-of-place transforms.

Using plannar and interleaved data format but current only supports transform on GPU device. Accelerate framework can be used on CPU.

Current version supports sizes that fits in device global memory although "Twist Kernel" is included in fft plan if user wants to virtualize (implement sizes larger than what can fit in GPU globalmemory).

*gpu-z 0.37: Shows DirectCompute (supported version also) and OpenCL check boxes.. OpenCL ati is not detected..
*gbench 1.0 released based on Matlab jacket product similar to matlab bench builtin func and works wothout Matlab also..
Checks FFT, Dense blas, bench..

Benchmarks include six different tasks, common to the technical computing community:

1. LU: LU decomposition of 1024 x 1024 matrix
2. FFT: Fast Fourier Transform of a 2^20 x 1 vector
3. BLAS: Matrix multiplication of two 1024x1024 matrices
4. 3D Conv: Convolution of 64x64x64 array with 3x3x3 kernel
5. FOR/GFOR: Matrix-vector multiplication of 1024x1024x32 array
6. Equations: Solution of a system of 1024 equations

*3D Vision news:
->Avatar demo with 3D Vision builtin is impressive tough goes from 60 to 20 fps
though have to use d3d10 path 9 seems fixed in 195.62
->3D vision on Linux supported for quadro cars on 195.22 (quadro only and requires mini din connector and connected before x starts no hotplug)
-> 3D vision 195.55 and higher ship with browser plugins (IE,firefox) for 3d photos and also upcoming windowed support.. see tweaktown..
*Nvidia released 195.62 WHQL candidate and 195.22 for Linux public..
*AMD released 9.11 WHQL CAL supports OpenCL
*Direct3D 11 benchmark for Stalker
*PGI 2010: CUDA fortran and accelerator model for Windows and MAC and stable for Linux
*Khronos OpenCL BOF presentations posted: especially interesting are LANL pdf showing perf of molecular code of VMD (electrostatic potential) on both Intel SSE multicore,OpenCL (CPU,AMD,NVIDIA and also Cell)..
What you learn:
shows perf issues on Cell about lacking __constant and how to overcome this..
shows tables of perf of all this arch.
points key issues in OpenCL right now
Fermi as a GPU:
http://techreport.com/articles.x/17815

Posters about GPU computing
of GTC
of SC09

Porting a efficient bit library in CUDA (with preliminary perf)
http://bmagic.sourceforge.net/bmcudasse2.html

Implementing integer multiprecision in OpenCL
on cuda: "Implementation of Multiple-precision Modular Multiplication on GPU" Kaiyong Zhao
see poster:
http://www.nvidia.com/content/GTC/posters/87__Kaiyong_Implementation_of_Multiple-precision.png

There are also work by on Daniel Bernstein Elliptic curves and also on RSA both in Eurocrypt 2009 conference..
interested also are mpir gpu

Source code of DCGN – Message Passing on GPUs released (old news):
http://jeff.bleugris.com/journal/2009/06/02/looking-for-dcgn/
http://jeff.bleugris.com/journal/projects/
know that I don't know if code is updated but if not is somewhat bad since CUDA 2.2 introduced pinned host mem for GPU accessing that an avoiding polling the CPU and doing cudamemcpy gpu->cpu for inspecting if GPU has new things to do.. now polling is done on CPU mem and GPU writes to CPU mem..

fem codes on CUDA:
http://sites.google.com/site/monkology/gpuprogramming-project3-final

papers/posters:
fluid on GPU by Michael griebel as poster on GTC09
indexing the internet with gpu (cuda zone)

Posted on Apple OpenGL forums:
Here is a simple example that uses GLUT, it reads a png image (arg1) creates a source and dest texture/image, then uses a kernel to clip out the red.
example of simple cg-gl interop

This is interesting a year ago I was searching on bulding WRF on Windows.. there was some efforst some years ago but overall it's was a hacky port and also with old base code..
This was for testing WRF perf of CUDA ports of physics microkernels WSM5..
tere is a web page:
now PGI has done my dreams come true and provides a very clean patch file for latest WRF (3.1.1) for compiling on latest PGI compilers.. I think 9.0-4 or higher but now 10.0 should also support it.. anyway it's good news for Windows users and I want to obtain a VS2008port from this.. it may need some work for lot less than ever.. see in PGI October newsletter..
"Porting the Weather Research and Forecasting Application to Microsoft Windows Using PGI Workstation"
Also is good to know that this has been dome for the same purpose as I wanted.. to test WRF working on GPU now with the Accelerator model..
There is another article on the same newsletter..

ATI 9.12beta (8.68) only XP
includes ATI CAL 1.4.492 vs. OpenCL beta4 CAL (1.4.467)

Windows guest drivers for KVM
http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers

Virtual texturing demos
http://linedef.com/personal/demos/?p=virtual-texturing

Hierarchical voxel rendering demo
http://linedef.com/personal/demos/?p=hierarchical-voxel-rendering

Posted in | No comments

News from OpenCL forums!

Posted on 02:21 by Unknown

1.Seems that AMD imp avoid running CPU kernels and GPU kernels simultaneously altough running asyinc on different queues.. Seems is serialized and can be considered if true a perf issue..
search forums:
Possible to run OpenCL code on GPU and CPU concurrently?
seems not
Further information: I checked the

CL_PROFILING_COMMAND_QUEUED, CL_PROFILING_COMMAND_SUBMIT, CL_PROFILING_COMMAND_START, and CL_PROFILING_COMMAND_END

for each kernel, and the second kernel (the CPU kernel) is indeed waiting until the first kernel (the GPU kernel) finishes before it gets submitted. Both end up in the queue immediately (and there are two command queues), but the second doesn't get submitted until the first finishes.

2.AMD samples are noted to be not optimized but you can get more performance by minor tweaks:
I thought I'd post a little update on this. Once I delved into the code a bit more, I found that the default block size was 8. Once I changed this (and once I modified the code so it didn't give me an error that it was set too high), many of the examples run much faster on the gpu than before.

in another thread was suggested that group size should by equal to wavefront size which is 64 for 48xx and 58xx.

3. Seems a perf issue altough using OpenCL bandiwth test (I use mapped and can be so good because I own a Nehalem and bandwith is also so good on Nvidia without using pinned mem) because of nvidia sdk you get high bandwitdh (only d2h or h2d on this sample d2d seems is not good)

I use clEnqueueRead/WriteBuffer with blocking mode on Radeon HD 5750.
But wrute throughput is lower than result of PCIeSpeedTest(ATI Stream Power Toys).
And read throughput is very lower than write throughput. why ?

Test pseudocode:
size = 1024*1024*64;
NUM_TIMING_LOOPS = 100;
buf = clCreateBuffer(context,CL_MEM_READ_WRITE,size,NULL,&errcode);
stopwatch.start (); // use PerformanceCounter
for (int i = 0; i < NUM_TIMING_LOOPS; i ++) clEnqueueWriteBuffer(queue,buf,CL_TRUE,0,size,ptr,0,NULL,NULL); stopwatch.stop (); printf (...); Result: write: 2.575GB/s read: 1.197GB/s PCIeSpeedTestResult (v0.2): [ 67108864 bytes] CPU->GPU= 4.851 GB/sec, GPU->CPU= 861.791 MB/sec

Confirmation of OpenCL perf issue:
This is because of the difference in implementation of PCIeSpeedTest and OpenCL. The PCIe Speedtest goes directly to pinned memory while the OpenCL version copies to PCIe and then to the user memory. We are working on a more optimized path that can avoid this copy under certain conditions in a future release.

4.Nvidia provides OpenCL visual profiler and Amd is working on similar tools:
We'll be providing an MSVS-integrated profiler that will be capable of reporting the profiling counters in the next release. In the next few months, we'll also provide a Stream Kernel Analyzer that will accept OpenCL C for static analysis of your kernels.
Meanwhile use the solution in my first post on the blog to get kernels in AMD IL code..

5. printf works in CPU kernels in Linux backend (in Apple there is a similar debug extension) OpenCL.
DUMP
Yes printf currently is only supported on the CPU device as there is no standard library in OpenCL that contains the printf function in GPU , so it is not valid on every device. This is stated in 6.8.f of the OpenCL 1.0 spec. Apple does support printf in the kernel as a standard debug strategy when using the CPU device.
Here is what I did to get printf to work within my kernels (I am using OpenSUSE though). I just put the stdio.h file in my working directory
Code:
const char *header = "-I stdio.h\0";
err = clBuildProgram(program, 1, devices, header, NULL, NULL);

on GPU? because i have 9.9 and on CPU work too.
of course gpu ... 9.9 didnt support opencl seems like 9.11 does.

6.AMD OpenCL doesn't work with MingW.

7.AMD reports supported OpenCL devices for R6xx cards altough in then fails:

Profiling : Yes
Platform ID: 00000000
Name: ATI RV610
Vendor: Advanced Micro Devices, Inc.
Driver version: CAL 1.4.467
Profile: FULL_PROFILE
Version: OpenCL 1.0 ATI-Stream-v2.0-beta4
Extensions:
Thanks for reporting this. The 6XX series of cards do not have the required hardware to execute OpenCL kernels, so this should not have been displayed as available for execution.

8. example of a reduction on 3 pass using shared registers. He had problems getting to work seems the key issue is:
Shared register not updated as it ought to be..
Answer:
Just went through our documentation. One very important piece of information is left out that will fix your problems. Access to shared registers is only atomic if done in a single instruction.

i.e.
iadd sr0, sr0, sr1 is correct
but
mov r0, sr0
mov r1, sr1
iadd r2, r0, r1
mov sr0, r2 is incorrect because of the even/odd wavefront issue.

Posted in | No comments

Bugs in OpenGL AMD drivers: Geometry shader and texelFetch2D

Posted on 01:02 by Unknown

This avoids testing marching cubes Cyril Crassin demo..
it uses Geometry shaders and texelFetch2D..
Need to test when Geometry shaders are exposed in upcoming 10.01 I hope..
Geometry shader and texelFetch2D
A geometry shader (perhaps a vertex or fragment shader also) containing this code:
uniform isampler2D triTableTex;

int triTableValue(int i, int j){
return texelFetch2D(triTableTex, ivec2(j, i), 0).a;
}

using texelFetch2D from and integer texture fails with:
Compilation error: Geometry shader failed to compile with the following errors:
ERROR: 0:60: 'texelFetch2D' : no matching overloaded function found

if texture is float there are no errors.

Please see detailed post on AMD forums:
http://forums.amd.com/forum/messageview.cfm?catid=347&threadid=117742&highlight_key=y

Posted in | No comments

Testing LDS perf in OpenCL!

Posted on 00:47 by Unknown

Found on AMD forums:

I ran this kernel :

Code:
__kernel void ldsReadBandwidth(__global float4 *output)
{
__local float4 lds[128];
float4 val = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
val = val + lds[0];
val = val + lds[1];
val = val + lds[2];
val = val + lds[3];
//......
val = val + lds[127];
output[get_global_id(0)] = val;
}
and subtracted following kernel's execution time from the above one to eliminate adds and writes to global memory :

Code:
__kernel void ldsReadBandwidth(__global float4 *output)
{
float4 val = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
val = val + val;
val = val + val;
val = val + val;
val = val + val;
//......
val = val + val;
output[get_global_id(0)] = val;
}
Result : I am getting 540 GB/s on Radeon 5770.

That's a bit disappointing, as it's only 1 float per Vec5 unit per clock. In ATI's counter-Fermi presentation they stated the 5870 could access LDS at 960 floats per clock, i.e. 3 per Vec5 per clk.

A question :Does the performance stay the same if you add writes to the LDS?

Posted in | No comments

OpenCL bugs!

Posted on 00:45 by Unknown

Hi altough this can be a crazy idea, I'm posting all the bugs and issues present in current AMD,NVIDIA and Apple imp in the hope of checking for correctness of upcoming imps..

Now the imp that has less bugs is Nvidia in 195 drivers. So let's start:

Nvidia 195:
*Kernels with no parameters doesn't compile
*ATI AES sample: a&0x80->if(a>127) a=128
*Nvidia lists perf issues in Vista and Win7 with multiple GPUs
*I observe very slow speed in Vol3D demo in Win7 vs Linux
*(not OpenCL standard but others imp work)implicit conversion of uint4 to float4 and for math functions not found as hypot etc..

ATI
*No currently using Windows OpenCL ICD model
*No currently OpenGL interop, image support, doubles and getting built binaries for 4xxx and 5xxx.
*No extensions for 5xxx (atomics global,local, byte addresable, 3d image writes)
*CPU and GPU not simultaneously work
*Volatile qualifiers
*mem leaks
*uint4 and bitwise operators
bug

/tmp/OCLPevl34.cl(373): error: bad argument type to opencl convert_* function:
expected src and dst have the same number of elements
float4 f = convert_float4(((int4)1) && ((int4)0));
^
Apple
Perf issues in runtime and code generation in 10.6.2 (?) in Apple OpenCL FFT.

Posted in | No comments

Benchmarking OpenCL and DirectCompute!

Posted on 00:29 by Unknown

*SiSoftware Sandra 2010 (OpenCL, CUDA, CAL) Still not released
*NLM demos in Beyond3D (OpenCL and DirectCompute)
*DirectCompute Benchmark 0.35 (OpenCL and DirectCompute)
*gbench (CUDA)

Posted in | No comments

Wednesday, 25 November 2009

Benchmarking stientific kernels on OpenCL!

Posted on 11:10 by Unknown

I hope you're interested in measuring OpenCL perf.!
I expect to have some updates on my blog that would concern your interest!
Here is things I plan to do this coming weeks:
In case you're interested in measuring OpenCL perf on key scientific kernels I will try to test:

*Peak flops: What perf I'm able to get by using a kernel of mads (multiply and add) in integers, floats, double's and integer with 24 bits of precision..
*Dense linear algebra:BLAS code (matmul for example): linear algebra perf... Linpack uses BLAS calls..
*FFT code (fast fourier transforms): Useful in image processing/ compression and scientific simulations
*Sparse linear algebra: sparse matrix vector product
*Multiprecision int: cryptography

I'm posting soon (hope this week or next) benchmarking code for getting FFT (basically a port for Windows of Apple FFT library .. http://developer.apple.com/mac/library/samplecode/OpenCL_FFT/index.html) and exploitable gflops peaks (in single, integer, integer24 and double operations)..
Also perhaps this month some integer multiprecision perf and sparse matrix perf (conjugate gradients)..
Also may concern you I have posted already some old (but still interesting) benchmark of a very high efficient code for testing matmul perf. (in CAL, CUDA, and multicore SSE)
http://oscarbg.blogspot.com/2009/11/matmul-bench-for-cuda-cal-and-multicore.html
This tool is of much interest now note that matmul perf. of Larrabe was unveiled at SC09 by Rattner a week ago to be near 800Gflops using my tool on 5850 i get nearly 750Gflops..
Nvidia GTX 275 only 400-450Gflops (Fermi will double that I hope).. so seems all GPUs are currently similar in matmul perf..
Currently matmul perf on OpenCL seems to be low (at least 2x slower..) .. but that can be to currently avaiable OpenCL code in SDKs not optimized for either Nvidia and AMD..
I will try to get an efficient port at least for Nvidia before end of the year.. Hopefully also efficient for CPUs and AMD GPUs..
Sparse matrix multiply of Larrabee also was unveiled to be 8Gflops..
Note but that specific sparse matrix format was not unveilled (that can have an impact in perf.).. see current most efficient implementation on Nvidia GPUs (CUDA based) http://www.nvidia.com/object/nvidia_research_pub_013.html
the code is here:
http://cusp-library.googlecode.com/files/sc2009_spmv.zip
I will try to get binaries for Windows testing..
Currently you can test a efficient code in CUDA of this kernel using "Concurrent Number Cruncher" from Inria.. well it won't work because of compiled in CUDA 1.0 and doesn't work (altough CUDA seems to be binary compatible in last releases and it's meant to be it seems that binaries compiled for CUDA 1.0 aren't) so you have to recompile with
new CUDA libs.. I have these new binaries I can supply you with..
I'm getting between 3-5Gflops but note these kernels are bandwith bound no computation bound, so it really depend of memory bandwith and perhaps of memory perf..
so Fermi with cached global mem will get possibly more than 2X increase.. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
OpenCL limited BLAS and conjugate gradient seems seems to be avaiable from:
http://sourceforge.net/projects/openclblas/
I have to test tough I remember seeing in a blog that this is Nvidia optimized..
Regarding OpenCL integer multiprecision you have an excellent tutorial with code snippets here:
http://www.bealto.com/mp-gpu.html
Note he benches also arithmetic kernel and mem perf. with latest 195 drivers.. (Nvidia only)

Later I will try to compile OpenMM 1.0beta with OpenCL support and bench it.. (moluecular code kernel of Folding@Home)
Note the binaries provided are CUDA capable only.. https://simtk.org/home/openmm

Note I'm providing with the latest things I'm aware of..

Posted in | No comments

News from the web!

Posted on 10:45 by Unknown

Before I forgot:
* Cell development stopped.. No further Cell platforms
* Imagination working on OpenCL compilers..
* Magma 0.2 without source code as originally intented but:
Sources will be released soon. Hopefully in december.
* Nvidia 3D Vision supported on Linux! Quadro only (195.22)

Posted in | No comments

OpenCL learning and tutorials!

Posted on 10:30 by Unknown

Well there are a good compilation of exisiting documents/resources for learning OpenCL: See multicoreinfo.com compilation
Now the are new ones since it was released:
See
http://www.amd.com/us/products/technologies/stream-technology/Pages/training-resources.aspx

OpenCL™ and the ATI Stream SDK v2.0

OpenCL™ Tutorial: N-Body Simulation

PPAM 2009 GPU and OpenCL Tutorial
See OpenCL Architecture and Optimization on AMD GPUs (Behr)
for AMD specific things..
remember Nvidia provides better Optimization guide..

Posted in | No comments

Porting CUDA to OpenCL!

Posted on 10:22 by Unknown

Well so you want to port CUDA code to OpenCL: you are in AMD GPU competition of porting Cuda codes to opencl (see previous post) or you are willing to port some of you exising cuda codes to opencl for working in amd also..
well there are some pointers to start:

1. nvidia jumpstart guide contains an example of porting a Cuda (but driver api) exaple of vector add to opencl. it examines in detail every api difference from memory setup, kernel launch, setup parameters, etc.. Also show some limitations in opencl vs cuda who you apparently can not have in mind by reading the opencl spec (notably lack of pointers..)
2. amd provides in her good article
"OpenCL™ and the ATI Stream SDK v2.0"
a section on Porting CUDA to OpenCL:
http://developer.amd.com/documentation/articles/pages/OpenCL-and-the-ATI-Stream-v2.0-Beta.aspx#four

Posted in | No comments

GPU computing programming contests..

Posted on 10:05 by Unknown

Well with all known good programming contests (Google Code Jam, ACM comp, Internation Informatic Olimpiad, Topcoder, University of Valladolid ..)
now you want to measure your proefficieny in not solving complex problems but also in mapping to extract all performance of GPUs.. well there are one going on.. and also some other coming?..

Let's see:

1.
There first I'm aware of is own compettions for course between peers:
ECE 498 AL : Programming Massively Parallel Processors
http://courses.ece.illinois.edu/ece498/al/
In that course a sort competition was started also showing the winner (kaatz) having a faster than CUDPP sort kernel.. I don't know how well it behaves compared to current fastest imp in CUDPP 1.1 or CUDA radixsort sample but I believe this new are now better..
http://courses.ece.illinois.edu/ece498/al/HallOfFame.html

Some year ago there was :
2.
1st Annual UMD GPGPU Programming Contest
supposedly coming every year and rewarding with Nvidia GPUs..
first was good trying to get fast sparse matrix vector and matrix matrix mul routines.. winner was among the fastest one implementation know at a time..
now there is CUSP..
note it has not worked this year..

3. Nvidia and TopCoder

Nvidia and Topcoder started a competition in September right before GTC and anounced the winers in GTC.. Right now the most impressive competition since is general (for everywhere) good prices (10000$) good learning pointers (CUDA webinars).. good hardware (gets tested on a GT200 Tesla with 4GB mem) and good problems..
First was about a labelling component problem..
I'm hosting PDF files for future reference.. and some example files (but not 400mb ones..)
Now is running 2 contest from 23 November to 7 December a graph problem..
Unique problem? is that no OpenCL right now.. but CUDA is more mature..

4. AMD
Also note that AMD made public in 58xx launch around 10 September it's intention of starting competitions in GPU programming.. Two kinds one like Nvidia one solving challenging problems and two is also interesting porting existing CUDA code to OpenCL and efficiently their hardware I supppose..
See slide in noticias3d.com review..

So to end my next post is talking about porting CUDA to OpenCL..

Posted in | No comments

AMD 5xxx series overclocking..

Posted on 09:45 by Unknown

I 'm lucky enough to have a 5850 and I want to get near or surpass 5870 perf with it.. so say hello to overcloking.
Some weeks earlier I overclocked basically using the latest AMD GPU Clocktool for overcloking the clocks past what's allowed in Catalyst Control Center ( up to 775Mhz clock) and also using MSI AfterBurner to overvolt the GPU voltage..
Basically I overclock form 725 to 950Mhz with overvolting to 1.25V..
Now some issues to say:
-MSI would not overclock if you have enabled more than 1 GPU? (I have both Nvidia and AMD cards in Windows 7) at least not for me so disable Nvidia GPU or other GPUS in Device Manager..
You then overvolt and then can reeenable if you want..
I think you can overclock clocks in AMD GPU Clocktool with more than 1 GPU..
I now with MSI 1.40 have the option of doing all oc in it..
you search MSIAfterburner.cfg and add:
[ATIADLHAL]
EnableUnofficialOverclocking = 1
this goes past CCC (ADL limitation) clocks of 775Mhz..

Now what about permament oc or working in Linux and future MAC..
use a patched BIOS..
there are two options use oc clocks by default or override max oc clocks in CCC etc..
the safe way is the second..
CCC clocks are limited by ADL lib which in fact is supported on Linux so this second way is the best..
there is a BIOS in Internet of a 5850 modified to allow oc up to 1500 clock and 2250/2500 mem..
this BIOS doesn't change default clocks and power management (well last I'm not sure..)
search a 5850 oc article in geeks3d and from that search link to a BIOS..
it also speaks about needed tools..
that would allow getting oc in Linux to the same clocks.. using linux CCC..
But how to overvolt.. well there is a tool using ADL lib for Linux..
ATI Overclocking Utility allows to overvolt (theoretically) and also change fan speed if needed..

Posted in | No comments

OpenCL on Apple: update!

Posted on 08:16 by Unknown

OpenCL on Apple albeit being the first implementation to ship to standard users,
have its set of peculiarities and limitations:
As said Apple OpenCL imp altough supporting CPUs and GPUs the backends are done by Apple and so is different from imps shipped by AMD and Nvidia..
First limitation in Apple imp is regarding to hardware graphic drivers..
As you my know high end graphics cards are shipped with more than 6 months of delay normally.. so for example there is no AMD 5xxx series support and it's not expected at least at mid 2010..

AMD OpenCL

As you may know AMD 5xxx series is the first GPU from AMD designed with GPU computing standards in mind and has byte addresable stores, atomic instructions, general access local mem, etc.. among other things all of this lacking in 4xxx series which are currently only supported AMD OpenCL cards..
This features were standard in CUDA since G80 excepting atomics.. and all of this are exposed as OpenCL ex so 4xxx OpenCL support is barely minimal (well think OpenCL without extensions with bring up to CUDA functionality..)
Regarding OpenCL updates 10.6.1 shipped with no updates so only has had one update in 10.6.2.
This fixed several AMD bugs in own Apple OpenCL demos.. so 10.6.2 is the first driver at least having no issues in Apple OpenCL source code..
Regarding AMD OpenCL implementation I have some questions:
In AMD imp image support is not currently supported altough hardware has support for it I don't know if Apple imp has support enabled, also 48xx series support doubles in hardware and AMD imp is also lacking currently (expected through 2010) so Apple imp I don't know if and if yes when there are going to support it..
Also what about getting binaries (AMD imp currently not) in Apple imp..
All of this is going to be answared in an upcoming post when I have installed Snow Leopard 10.6.2 in a 48xx MAC..
So resuming I have questions about Apple OpenCL on AMD :
-Doubles?
-Image support?
-Getting device binaries kernels..?
-All OpenCL Apple demos run?
-Local mem is present and if yes through global mem ever or uses limited local mem features if kernel is simple enough.. (AMD imp not fixed..)
All are if are going to be supported and when.. I expect over time to be supported by AMD imp in Windows and Linux..
Also don't expect all other OpenCL extensions to be supported as are a limitation of hardware support..

Nvidia OpenCL

Nvidia support in Apple OpenCL drivers have been as good as own Nvidia imp in early times (since May June..) until begin of this month..
in that times it supported OpenGL interop.. Now Nvidia supports OpenGL interop and also doubles..
So what's going on in doubles on MAC?
I think will be supported in upcoming releases..
I also think in pre 195 drivers were issues with local mem atomics (GT2xx stuff) not being working, I would guess now are working so it's this issue present in Apple?..
Also Nvidia OpenCL SDK says there are samples that are not working this since pre 10.6.2 times. Unfornately 10.6.2 was released and still OpenCL SDK in CUDA 3.0beta shows this as a limitation.. it's really fixed?..

So Nvidia questions:
-Doubles?
-Local mem atomics work?
-Nvidia OpenCL samples ocl3dftd work?..

Apple backend

Release of Apple OpenCL FFT lib has shown (see readme.txt) that exist bugs in both Apple runtime and code generation for GPUs (all? which?) who forbid getting full perf on these high perf code..
Lickily is aware of the issues and has said that it has internally fixed these and recovered a 2X boost.. nearly to theoretical perf seems..
Seeing from the time of the release it's sure this fixes would come post 10.6.2 so perhaps for 10.6.3 time..
Also luckily when such codes get running at full speed I would consider OpenCL finally mature on MAC..

Posted in | No comments

Tuesday, 24 November 2009

State of the blog..

Posted on 02:55 by Unknown

Sorry for the delay guys of posting code of Apple OpenCL demos port..
the blog has been with no updated for more than 2 weeks in this rapid field.. so expect posting a lot of news this week (lots of sc09..)
News:
Source code coming this week with Linux port!
Trying to get also OpenCL FFT Apple code port to Windows also..
Seems that with Catalyst 9.11 final
you get Nbody working with AMD cards (previously render should use Nvidia)
see readme.txt:
2. Nbody the opencl simulation works in GPU but not renders currently with ATI OpenGL driver
(works if you use OpenGL Nvidia renderer in Windows 7)
Also in my blog you can find a comment on how to fix AES sample on Nvidia cards (seems a Nvidia driver bug regarding interger and op)
unsigned char hiBitSet = (a & 0x80);
with
unsigned char hiBitSet = ((a>127)?128:0);

Posted in | No comments

Sunday, 8 November 2009

Places where OpenCL shines!

Posted on 09:38 by Unknown

If you see that post a lot of my critic relates to graphics things and also
to the current state of the matter.. i.e implementations without extensions (AMD?) are less capable than DX compute standard.. and also more alpha state

But its all not that bad for OpenCL..

OpenCL is better:

*Run on CPU!
i.e more devices support it (PS3 Cell, CPUs,)
* MacOs and Linux support
*doubles currently work on Nvidia GPUs and CPUs on Mac OS X.. doubles on Compute shader
doesn't currenly work (see Voxilla Mandelbrot thread).. I expect fixed in December SDK..

*Local workgroups can be wisely sized by runtime
*Task support: Altough it's only presently useful for CPUs (IBM has an options sample using tasks, in order ques (see events))
Hey only for CPUs?, who knows current GPUs seems to be bad from current knowledge but anyway Larrabe would allow to use it eficiently together with SIMD mode.. and related a paper by Aila on raytracing
suggest that somewhat queue model taking tasks of a queue with warp granurality (i.e. batches of 32 elements) can be good
* Manage multiple devices at the same time each also with it's GL interop: why not I think DX compute shader is more linked to a device and in any case MultiGPUs is more awakly used in DX..
* Multiple Command queues (it's that concept in DX Compute?) to use async kernel and mem copies and simultaneous kernel (hello Fermi) support and twin mem transfers (hello another time Fermi)
Also queues can be used in out of order meaning work can be taken at any order.. OpenCL to DX Compute: I can can you?
*Synch between execution of multiple queues (and devices) via events..
In theory good for pipeline models and also work as for graphs of dependencies (see GRAMPS and intel work)
*Store binaries of kernels in manufacturers format and send without recompilation (I think you can store binaries but the bad thing if things hasn't changed is that is stored in Microsoft instruction format.. it seems that
is also bad for manufacturers that they decompile and compile with the optimization to their arquitectures..
As PTX and AMD IL are also supposed to work over generations of hardware seems more suited to their driver requiring less work and possible more performance
*DX and OGL interop
*Possible with time better, at least improve more timely than DX Compute similar to OpenGL with vendor extensions (nvidia fermi features?) (ATI 5xxx sad,gds,wave sync)

Anyway know that OpenCL and DX Compute provide equal capabilty and above shaders (well pre dx11 shaders) and also are gpu vendor independent so seems that GPGPU programming with GLSL,CG is a lost of time at least if not is image related..
All have CUDA 1.0 functionality i.e. scatter to mem (minus byte addresing in OpenCL) and workgroups with shared mem and sync across workgroups (well at least in concept)

Posted in | No comments

Running Optix with Geforce in Linux

Posted on 08:24 by Unknown

Runnig Optix with Geforce in Linux is easy
use hexadecimal editor:c
change all Quadro to Geforc should do
if not search something as Quadro.FX.CX and change to GeforceGTX
It worked!

Posted in | No comments

New exciting soft and info coming this year!

Posted on 07:39 by Unknown

*PGI has to ship both Fortran for CUDA and Accelerator model for Windows and Mac in PGI 9.1!
*CaustcsRT emulation SDK? I will have to compare vs
Optix SDK released some days ago
from an API point of view..
Possible make a wrapper..

*Intel CT beta is coming this year by SC09?

*Intel new Larrabe things will be known at this SC09
Rattner intervie

Posted in | No comments

Matmul bench for CUDA, CAL, and MultiCore CPUs!

Posted on 07:23 by Unknown

Test Nvidia, AMD GPUS and CPUs (CUDA, CAL, CPUs)

Key points of benchmark:

*Uses very high performance codes:
*Win x32,x64! (big win in x64 in multicore CPUs)
*Single precision and double precision!
*Test any size!
Download!

Also Works on Wine! (at least it worked with the 32 binaries with Wine Wrappers around CUDA and CAL I did a year ago but which I think I sadly lost! anyway where based on ideas around CUDA Wine wrapper for Folding@home)
There were issues in CPU using only one thread altough I forced using it via OpenMP env variables and Intel MKL env variables..

Also I tested x64 with a CAL Wine64 wrapper.. but crashed..

Uses very high performance codes:

*For Nvidia GPus uses CUBLAS fast matmul (volkovs code)
*CAL matmul uses AMD IL from AMD SDK
(anyway better is known see beyon3d 1tflop matmul)
*Uses Intel MKL libs (hopefully also using new dll's you get AVX acceleration (at least I hope so..))..

Also assuming Larrabe executes normal executables and has a intel MKL library with binary compatibilty then
also that..

I done it past year!
for testing a 8800GT and 4850!
now with Windows 7 you can use to test simultaneuos perf on ATI and Nvidia and CPUs!

Still working only need to use HxD to change references from amdcal to atical DLL's
There I can remove the dll's as a part of AMD driver and I get also new GPU support automatically..
tested on ATI 5850..

Source code: sorry guys but I would have to search very hard for finding it! it's old cde..
Results:
Core i7 920
ATI 5850
GTX 275

Build info: X86 Release
Build date: Jul 31 2008 20:45:28
Machine arch: little endian.
Number of CPU cores: 8
Timer used: MULTIMEDIA TIMER
Freq. of TIMER: : 2.63554 Mhz.
CPU: 2716.93 Mhz.

Size: 4096

CAL tests:
=========

Float
Tam: 4096 Time running: 0.369012 s. Gflops: 744.902950 Gflops.
Testing CAL matmul double
Double
Tam: 2992 Time running: 0.483039 s. Gflops: 221.800549 Gflops.

CUDA tests:
==========

There is 1 device supporting CUDA
GPU 1: "GeForce GTX 275" SM:1.3
MP: 30 Cores: 240 Freq:1.40 GHz
Mem: 896Mbytes BW (Pinned) H2D: 2.65GB/s D2H:2.81GB/s D2D:55.3

CUDA floating code
Time running: 0.432000 s. Gflops: 318.106889 Gflops.
CUDA no IO floating code
Time running: 0.339000 s. Gflops: 405.375151 Gflops.
CUDA double code
Time running: 1.844000 s. Gflops: 74.523957 Gflops.
CUDA no IO floating code
Time running: 1.682000 s. Gflops: 81.701651 Gflops.

CPU tests:
==========

Intel MKL code
Time running: 2.573000 s. Gflops: 53.409318 Gflops.
Double
Time running: 5.825000 s. Gflops: 23.591790 Gflops.
OK.

Posted in | No comments

More than 10 places where DX Compute 5.0 is better than OpenCL!

Posted on 05:30 by Unknown

Sorry xbitlabs but I don't think this..
Please understand most of my thoughts are based in using one API vs other today or in the near future and
based upon some common sense on companies involved..

The whole point is that comparing OpenCL 1.0 and DX Compute bare metal DX Compute gains in functionality and richness of features to OpenCL.. Also because of the potential broad market an use in blockbuster next gen games vendors will spend more time optimizing DX Compute drivers.. at least in the early days which is today.. this is true and ATI ships phenomenal DX Compute drivers for 5xxx series..
Nvidia has done a great effort in CS 4.1 which anyway continues to be more rich that OpenCL 1.0
Also as with OpenGL vs DX development today there is the mad situation of OpenGL of having to have multiple rendering paths in function of extensions if you want to use the last feature and DX having almost no cap bits..
In DX compute you not only can use all that is avaiable (which is more than OpenCL by default) also is expected to be fast see 5

Let's see why..

OpenCL being targetted to HPC and scientific applications and being brought forward by a lot of companies has not enabled by default a lot of functionality shipping in old CUDA hardware ( I mean atomics for example)..
this is for a lot of vendors can claim compliance (say CPUs,GPUs (also S3), PS3 Cell (embedded profile), Power SGX chips(embedded profile)..
also as there a also not GPU implementations a lot of graphics stuff is optional..

Anyway OpenCL regains with current extensions a lot of the current functionality to no DX Compute but also CUDA..
As always (I mean as in OpenGL drivers) Nvidia has done a good work and today in 195 drivers for both Linux and Windows OpenCL with good performance and also atomics,byte addresable mem,graphics interop, and double support..

Anyways don't get pessimistic as the situation is similar today with OpenGL there are major vendors shipping mediocre drivers (hi Intel) but anyways if you stay with Nvidia (ok.. AMD does quite well today too) you can get OpenGL 3.2/3.1 drivers and a lot of other extensions.. and that enables today Nividia to ship almost equivalent functionality in DirectX 10.1.. What about DirectX11 well Nvidia has some cooked functionality in 195 drivers and I don't doubt they will have same day as Fermi is released a lot of extensions to bringing in parity to Direct3D 11.. And also that will to Linux users.. In fact Nvidia has always enabled to the point of Geforce 8800 launch demos being OpenGL ones the majority and using thier DX10 OpenGL propietary extensions (which anyways in less or more time and with some makeup get into ARB or the GL Core functionaliy)..

Details next:

1. Graphics interop
=============
I mean with APIs is builtin needed not extension (cl_khr_gl_sharing) and shipping today both vendors..
Only Nvidia in OpenCL.. ATI working on it..
Also while is supposed to be without copies on both APIs I believe that somewhat can be a fool on some early drivers in OpenCL.. DX Compute I feel is good implemented in both vendors today anyway..
Bad for DX Compute is only DX interop while OpenCL gets OpenGL and will get DX interop (at least supposed on ATI)

2. Image support builtin
================

Using textures from cs is not an extension and shipping today both vendors.
Only Nvidia in OpenCL today..

ATI working on it..

3. CS (5?) can write to backbuffer directly
=========================
See Voxillas Mandlebrot demo for ex.
Best OpenCL can do is render to a renderbuffer or texture attached as a color buffer of a FBO using OGL interop which is an extension..
Needs anyway using a copy Frambufferblit to the backbuffer..

4. DX Compute has local, global atomics and byte addresable mem (RWByteBuffers) by default (CS 5.0)
================================================================
This are currently shipping today both vendors (well Nvidia without atomics becuase isn't in CS 4.1)
In OpenCL we only have in Nvidia today..
ATI working on it..

5. Local mem is always fast hardware local mem (shared mem nvidia,lds ati)
=================================
CS 5.0 exposes general r/w to local mem as DX11 has it (Nvidia has since G80 but ATI not in 4xxx)
Cs 4.1 exposes limited write abilty similar to 4xxx so 4xxx can be used very fast in this concrete cases..
In OpenCL 4xxx local mem is emulated using local mem so than can get programs say 10-20 times slower than in 5xxx and also fool programer thinking using local mem will get code fast and get slow (because double mem compies if is buffering to local mem)

6. 3D writable textures (CS 5.0)
======================
Altough 2D writable textures without copies have not been in CUDA world until May 2009 (well in non beta form) DX Compute comes with 3D writable textures.. name RWTexture3D..
a guy named Voxilla has released a 3D wave equation solver using this functionality..
it allows up to 250fps of a 400^3 of floats impressive sutff.. 16gpoints/sec which using a FTD of 10 o 11 flops per point is equal to 180 Gflops and better about 500Gbytes/s bandwith..
they Bw are so high because we are writing to memory that is cacheable and better yet tuned for spatial locality at least 2D (not linear mem locality) (global mem is not at least in pre Fermi days or 5xxx days)..
Well there is a trick by a user (that I will explain one day) that allows to write to 3D Textures in CuDA but it's not very efficient and possibly not future proof (the user reversee enginered in mem how 3D textures where stored) and write code to read and write to a specific pos..
ATI shipping today..
OpenCL has this functionality as an optional extension and expect requiring DX11 hardware in GPUs there is no support today (ATI is a bit later in OpenCL extensions and Nvidia has no DX11 hardware)..
So we hope in Fermi time.. ATI will not get eariler

7 No builtin support for append consume buffers
=================================

Not extension now for OpenCL.. AMD supposedly working on it?

8 Interop with shaders which can do scatter to textures
=====================================
This is a SM5.0 killer feature of fragment shaders write with random access to mem within shaders can be good for example to an performant and memory efficient Order Independant Transluceny..
In OpenCL you need OpenGL with some equivalent DX11 extensions (anyway both AMD and Nvidia are working on it and AMD ships in current drivers but no doc)

9. ATI has DX11 optimized vs OpenCL
==========================
Believe me or not, but ATI has very good Cs drivers as games are coming but OpenCL drivers are rude..

10.Autovectorizing& MultiGPU
=====================

Compilear Shaders have autovectorization for pre G80 cards and also for 5way SIMD in ati cards right now this is not supported on AMD where would be needed for code to go full speed..
Its also not for AMD Cpus for using SSE trough
Anyway if I make the effort to try to write in vector code (old Cg days) I want at least to write on time so I have to test if say float8 or float16 which would go well for AVX and Larrabe are also efficient on CPUs with SSE on AMD platform and AMD GPUs i.e for example troughput using float8 is half of float4 and float16 1 quarter.. also seens that there is a compiler hint (I don't know if it's get by the compiler or the users directs the compiler for vector size optimal).. so I have to see it how to perhaps write efficient and variable floatn code using info from/to compiler..
IBM on the Cell SDK says to autovectorize code inside and across workgroup threads so seems very good..
but note that by default CEll compiles to 1 item per workgroup unless you request it in code.. perhaps in this case all is lost..
Also Intel upcoming OpenCL seems to be able to use autovectorization as with shader compilers..
as using floatn
ATI has expressed interest in trying to get it but it seems at least 6-12 months of waiting if not 1-2 years..
judging by OpenGL driver..
Nvidia is the more lucky in this respect as maps good to a scalar one in concept (learn about warps..
All this is said because I think that DX Compute on AMD is using autovectorization since I think it's done by the same driver team of DirectX that is good at compiler shaders.. and benchmarks of DXNvidia Ocean demo on ATI I have to test with scalar and vector throughput kernels..
Last thing to note is about MultiGPU anandtech showed using multigpu nvidia ocean have to see if it's using
2 devices for compute shaders or drivers as for shaders have support for it..
No can be general Obviously. at least if more than 1 kernels are launched and memory is changes or if atomics? are used

11. Guarantess minimums values for shared mem size and workgroup (?) and 3D grids
===========
I will have to check what I say but I think DX Compute requires support for running in some minimum of elements and also minimum fast local mem size..
OpenCl requires no local mem(?) and workgroups can be of size 1 (at least on CPU, as Apple use 1)
DX Compute 32Kb and some minimum size for groups and 3D grids are required..
This allows for 1 implementations that works in every DX11 device and OpenCL imp have to have fallbacks if wanted to run on CPU for ex.
Also assuming for ex s3 uses 1 item per workgroup and has no local mem what has about pixel shaders (mm.. scatter to mem)

12. Images read and write simultaneous.. ?
===========================================
In openCl __readonly and __writeonly. DX? RWTexture ?
at least sm 5.0 alow fragment shaders to do it but similar to nv_texture_barrier
also mem scatter and global atomics so you lose local groups and shared mem and __syncthreads

Posted in | No comments

CUDA 3.0 has CUBLAS functions for MAGMA with complex types (SP DP)

Posted on 04:35 by Unknown

Posted in Magma forums:

Perhaps you don't know but seems Nvidia shipped a better CUBLAS to CUlatools as they have near all this extensions from 30 september!
You can grab the basic and get cublas.dll,cublas.so with all the functions without a .lib in Windows for linking to it.
Anyway with 3.0 you have this
functions that seem to be so useful:
* cublasCtrsm()
* cublasCtrmm()
* cublasCsyrk()
* cublasZtrsm()
* cublasZtrmm()
* cublasZsyrk()

It would be good if you can use them to release (at least in later build than 14) with complex data types and single and double precision
- Added the BLAS1 functions:
* cublasZaxpy()
* cublasZcopy()
* cublasZswap()
- Added the BLAS2 functions:
* cublasDtrmv()
* cublasCtrmv()
* cublasCgemv()
* cublasCgeru()
* cublasCgerc()
* cublasZtrmv()
* cublasZgemv()
* cublasZgeru()
* cublasZgerc()
- Added the BLAS3 functions:
* cublasCtrsm()
* cublasCtrmm()
* cublasCsyrk()
* cublasCsymm()
* cublasCherk()
* cublasZtrsm()
* cublasZtrmm()
* cublasZsyrk()
* cublasZsymm()
* cublasZherk()

Posted in | No comments

Saturday, 7 November 2009

About IBM OpenCL

Posted on 09:12 by Unknown

IBM SDK

======

For Ps3 and Cell

*OpenCL 1.0 without extensions (32bit only).
*Includes OpenCL guide.
*Includes samples (julia,
*Includes a program to build kernel binaries.

Devices:
=====

Cell PPU (CL_DEVICE_TYPE_CPU):Full profile (Power/VMX CPU)

CPU device global and local memory both map to system memory
FP denorms and fmad ieee2008 support

2.Cell SPU (CL_DEVICE_TYPE_ACCELERATOR)embedded profile

# of compute units on a SPU accelerator device is <=16.

Local memory size <= 256KB.

FP denorms,FP rtz,64-bit long/ulong,and fmad ieee2008 support
Better access each thread float4 for efficient DMA and __local and SIMD instructions (multiple also work float16)

Note: OpenCL utilizes all available SPUs to compose the SPU accelerator device.
Any external usage of SPUs will result in a CL_DEVICE_NOT_AVAILABLE failure at context creation.

Both devices support:

Device and native kernel execution
Compilers are available
Out-of-order command execution
Using built binaries without compiling

What about profiling?

Native kernel execution takes place on the CPU device for both supported OpenCL device types
i.e. in PPU for Cell.

Unsupported:
*OpenCL image objects
*OpenCL sampler objects
*Halfs and doubles (no ext)
* No atomics
* No byte addresable
*No OpenGL interop of course

Workgroups:
*3D
*Overall 256 work-items per workgroup

The global memory size and maximum memory allocation size of both device types is dependent on the overall size and availability of system memory.

In addition to any vector code within an application's kernels, the OpenCL compiler may auto-vectorize code within a work-item, and across work-items within a work-group.

OpenCL on cell
==========

Workloads that have simple control logic or high bandwidth regular access patterns map well to SPUs. These workloads map well to NDRanges because their access patterns are well defined. In addition to NDRanges, the OpenCL application programmer can create an OpenCL task that implements a double-buffering scheme by managing local store directly with async_work_group_copy() and __local variables. The SPU's ability to efficiently execute the OpenCL task and data parallel programming models allows it to run a wide range of workloads.
On CBEA systems, OpenCL applications should maximize the amount of work done on the accelerator device. Each SPU is a compute unit in the OpenCL accelerator device. The OpenCL runtime will schedule work-groups across and execute work items on the SPUs. The section Command Queues describes how best to structure an application for maximum concurrency.

SPUs
====

Each SPU has 256K of local storage that will be divided among the OpenCL kernel runtime, OpenCL program's text, __local variables and __private variables. An OpenCL program may contain one or more OpenCL kernels that share local storage. Kernels that require large amounts of local storage for __local and __private variables may have to reduce their work-group size because of the lack of local storage. These kernels should, instead, be separated into their own program so other kernels' resources do not limit their work-group size. However, kernels that do not require large amounts of local storage should be grouped together into the same program. OpenCL applications that group kernels together into a program may avoid unnecessary context-switching because all kernels in a program are loaded together.
Proper management of data flow into and out of the SPU is crucial to maximize performance. This includes managing local storage effectively by staging data whenever possible. The OpenCL runtime utilizes a software data cache that caches accesses to __global memory in local storage. When possible, it is preferable for a kernel to aggregate all loads for a work-group into a single async_work_group_copy() to a __local variable. This will improve performance by grouping all of the work-items load latencies into one common load or store. The load will also be larger, making more efficient use of the DMA engine. If all accesses to __global memory are issued with async_work_group_copy() instead of direct access through the __global pointer, the software data cache will not be included, saving ~80KB of local storage.
An OpenCL task may also implement a double-buffering scheme. Two or more __local variables can be used as buffers to stage data. The kernel can then initiate an async_work_group_copy() into one buffer, then compute the results on the second buffer. async_work_group_copy() will use the SPU's DMA engine to copy data while the SPU's vector engine is free to operate on the second buffer. Double-buffering maximizes performance by keeping the compute engine busy by eliminating the need to wait on data transfers.
The OpenCL application should use __global memory buffers whose type's size is a multiple of a quad-word (16 bytes). For example, a kernel that operates sequentially on an array of floats should instead aggregate four floats together and operate on a float4 vector. This will allow the OpenCL compiler to map vector operations to the SPU's native vector types, and optimize its use of the SPU's DMA engine by eliminating alignment checks. Code that uses large vectors (that are a multiple of a quad-word, e.g. float16) is easier to read than hand unrolling loops. The large vectors will be automatically unrolled by the compiler to operate efficiently on the SPU's vector engines.
If an application does not require strict IEEE mathematical compliance, the OpenCL kernels can be built with the -cl-fast-relaxed-math compile option. This will allow the compiler to include performance optimizing, code transformations like:
• Floating-point conditionals may be transformed such that strict compare ordering in the presense of NaNs may not be preserved.
• Floating point divides may be transformed into a reciprocal-multiply.
• Software support of infinites and NaNs is omitted for half_divide and half_recip built-ins.
CPU
----
On Power processors, the same hardware cache is used for __private, __local, __constant and __global variables. It is counterproductive to create __local variables to stage __global memory. Instead, the application should take special care to layout __global memory so that the data for work-items in a work-group is cache friendly

In CPU async_work_group_copy() host copy avoid and prefetch() not implemented

CURRENT RESTRICTIONS
==================
The -cl-single-precision-constant and -cl-denorms-are-zero compilation options are ignored.

To use work-group sizes greater than 1, the reqd_work_group_size kernel attribute qualifier must be specified. Increasing the work-group size generally increases performance. work-group sizes greater than 1+ (-cl-opt-disable)->error

No half datatype

OpenCL Build Program utility
===================
This utility will build a kernel binary from a user specified source, using the
       OpenCL APIs. A binary version of this utility is shipped in the OpenCL runtime
       rpm, and installed in as /usr/bin/opencl_build_program.

       The utility will read in the source file (either from a filename or from
       stdin), get the list of Device IDs that match the request (default is
       CL_DEVICE_TYPE_DEFAULT), create an OpenCL Context for that device, create an
       OpenCL Program from the source specified, call to build the program for the
       devices, get the built binary data and save it to a file.

Usage: ./opencl_build_program [DEVICE] [OPTIONS...] [FILE]

       Build OpenCL kernel binary for specified device type from the specified FILE.
       With no FILE, or when FILE is -, source will be read from stdard input.

        Device Types: (only specify one)

         -a, --accel              build for CL_DEVICE_TYPE_ACCELERATOR
         -c, --cpu                build for CL_DEVICE_TYPE_CPU
         -A, --all                build for CL_DEVICE_TYPE_ALL
         -d, --default            build for CL_DEVICE_TYPE_DEFAULT (default)

        Options:

         -f, --flags "options"    OpenCL Buld option flags (default: none)
         -o, --output output binary filebase (default output _.ocl
         -q, --quiet              no output (default: not quiet)
         -h, --help               display usage information and exit
$ cat kernel.cl
       __kernel void function(__global int *a)
       {
         int gid = get_global_id(0);
         a[gid] = 1;
       }
       $ ./opencl_build_program --cpu kernel.cl
       Binary built from source file "kernel.cl" for device "CPU Cell Broadband Engine, altivec supported" saved as file "kernel_CPU_Cell_Broadband_Engine,_altivec_supported.ocl"
ACCELERATOR CellBE processor
1
spu-xlcl kernel.c -o kernel.ocl -qcpluscmt -ma -qea32 -qnocrt -Wl,-entry=0 -Wl,-N -u _plugin_init
#

ACCELERATOR PowerXCell8i processor
1
spu-xlcl kernel.c -o kernel.ocl -qcpluscmt -ma -qea32 -qnocrt -Wl,-entry=0 -Wl,-N -u _plugin_init -qarch=edp -qtune=edp
#
static inline const char *GetErrorString(int errcode) {
switch (errcode) {
    case CL_SUCCESS:
      return "CL_SUCCESS";
    case CL_DEVICE_NOT_FOUND:
      return "CL_DEVICE_NOT_FOUND";
    case CL_DEVICE_NOT_AVAILABLE:
      return "CL_DEVICE_NOT_AVAILABLE";
    case CL_COMPILER_NOT_AVAILABLE:
      return "CL_COMPILER_NOT_AVAILABLE";
    case CL_MEM_OBJECT_ALLOCATION_FAILURE:
      return "CL_MEM_OBJECT_ALLOCATION_FAILURE";
    case CL_OUT_OF_RESOURCES:
      return "CL_OUT_OF_RESOURCES";
    case CL_OUT_OF_HOST_MEMORY:
      return "CL_OUT_OF_HOST_MEMORY";
    case CL_PROFILING_INFO_NOT_AVAILABLE:
      return "CL_PROFILING_INFO_NOT_AVAILABLE";
    case CL_MEM_COPY_OVERLAP:
      return "CL_MEM_COPY_OVERLAP";
    case CL_IMAGE_FORMAT_MISMATCH:
      return "CL_IMAGE_FORMAT_MISMATCH";
    case CL_IMAGE_FORMAT_NOT_SUPPORTED:
      return "CL_IMAGE_FORMAT_NOT_SUPPORTED";
    case CL_BUILD_PROGRAM_FAILURE:
      return "CL_BUILD_PROGRAM_FAILURE";
    case CL_INVALID_VALUE:
      return "CL_INVALID_VALUE";
    case CL_INVALID_DEVICE_TYPE:
      return "CL_INVALID_DEVICE_TYPE";
    case CL_INVALID_PLATFORM:
      return "CL_INVALID_PLATFORM";
    case CL_INVALID_DEVICE:
      return "CL_INVALID_DEVICE";
    case CL_INVALID_CONTEXT:
      return "CL_INVALID_CONTEXT";
    case CL_INVALID_QUEUE_PROPERTIES:
      return "CL_INVALID_QUEUE_PROPERTIES";
    case CL_INVALID_COMMAND_QUEUE:
      return "CL_INVALID_COMMAND_QUEUE";
    case CL_INVALID_HOST_PTR:
      return "CL_INVALID_HOST_PTR";
    case CL_INVALID_MEM_OBJECT:
      return "CL_INVALID_MEM_OBJECT";
    case CL_INVALID_IMAGE_FORMAT_DESCRIPTOR:
      return "CL_INVALID_IMAGE_FORMAT_DESCRIPTOR";
    case CL_INVALID_IMAGE_SIZE:
      return "CL_INVALID_IMAGE_SIZE";
    case CL_INVALID_SAMPLER:
      return "CL_INVALID_SAMPLER";
    case CL_INVALID_BINARY:
      return "CL_INVALID_BINARY";
    case CL_INVALID_BUILD_OPTIONS:
      return "CL_INVALID_BUILD_OPTIONS";
    case CL_INVALID_PROGRAM:
      return "CL_INVALID_PROGRAM";
    case CL_INVALID_PROGRAM_EXECUTABLE:
      return "CL_INVALID_PROGRAM_EXECUTABLE";
    case CL_INVALID_KERNEL_NAME:
      return "CL_INVALID_KERNEL_NAME";
    case CL_INVALID_KERNEL_DEFINITION:
      return "CL_INVALID_KERNEL_DEFINITION";
    case CL_INVALID_KERNEL:
      return "CL_INVALID_KERNEL";
    case CL_INVALID_ARG_INDEX:
      return "CL_INVALID_ARG_INDEX";
    case CL_INVALID_ARG_VALUE:
      return "CL_INVALID_ARG_VALUE";
    case CL_INVALID_ARG_SIZE:
      return "CL_INVALID_ARG_SIZE";
    case CL_INVALID_KERNEL_ARGS:
      return "CL_INVALID_KERNEL_ARGS";
    case CL_INVALID_WORK_DIMENSION:
      return "CL_INVALID_WORK_DIMENSION";
    case CL_INVALID_WORK_GROUP_SIZE:
      return "CL_INVALID_WORK_GROUP_SIZE";
    case CL_INVALID_WORK_ITEM_SIZE:
      return "CL_INVALID_WORK_ITEM_SIZE";
    case CL_INVALID_GLOBAL_OFFSET:
      return "CL_INVALID_GLOBAL_OFFSET";
    case CL_INVALID_EVENT_WAIT_LIST:
      return "CL_INVALID_EVENT_WAIT_LIST";
    case CL_INVALID_EVENT:
      return "CL_INVALID_EVENT";
    case CL_INVALID_OPERATION:
      return "CL_INVALID_OPERATION";
    case CL_INVALID_GL_OBJECT:
      return "CL_INVALID_GL_OBJECT";
    case CL_INVALID_BUFFER_SIZE:
      return "CL_INVALID_BUFFER_SIZE";
    case CL_INVALID_MIP_LEVEL:
      return "CL_INVALID_MIP_LEVEL";
    case CL_INVALID_GLOBAL_WORK_SIZE:
      return "CL_INVALID_GLOBAL_WORK_SIZE";
    default:
      return "Unknown";
};
}

Posted in | No comments

OpenGL interop perf in CUDA and OCL in Linux

Posted on 08:26 by Unknown

195.17 gtx 275

I attach the code of cuda volumerender using new CUDA 3.0 OpenGL interop and 1024:
http://dl.dropbox.com/u/1416327/volumeRender.cpp
see newapi for changes..

OCL
===
1680x1050
fullscreen
see post for enabling on linux:
oclSimpleGL (interop enabled): 480 fps

oclSimpleGL (interop disable): 280 fps

CUDA vs OCL
=========
cuda Volume 108fps
1024x1024 40fps
1024x1024 cuda new gl interop 40fps
ocl oclVolume 180fps
ocl 1024 don't work

as bonus nbody perf:
fullscreen
ocl 23.0
gl 25.3

Posted in | No comments

Fraps like for Linux and for Windows DX11!

Posted on 05:06 by Unknown

http://the-hydra.blogspot.com/2007/12/finding-frame-rate-tester-for-linux.html

graps

http://blog.smr.co.in/linux/graps-capture-opengl-frames/

graps is a tool for grabing frames from openGL application, without editing the application. graps can be used for making video from an opengl application. graps can also be useful for monitoring fps of opengl application.
graps writes the glReadPixel data to file on each glXSwapBuffers call. graps uses the power of LD_PRELOAD to run without modifying the application.
graps is dirty fraps alternative for screen and realtime video capturing and fps monitoring ( hence benchmarking ).
graps can be downloaded from here. See readme.txt for help.
Update1 [08.09.2007] :
- recording to video [ using mencoder ]. Thanks to Mineral for hints (see comments).
- grapsctl, separated frame capturing and writing to image/video.
- bug fixes : ppm header (thanks matt) and width*width (thanks Mineral).

- graps v0.2 release.

Opinion
I have been not able to use run.sh file I have modifie deleting -a checking for .so
and exporting GRAPSPATH to my dir so I don't need to define anywhere.
basically is:
export LD_PRELOAD=$HOME/graps/libgraps.so
rename to rungraps.sh and
Copy this file to /usr/bin
Also copy grapsrc to $HOME/.grapsrc
good:
configurable
not alters gl execution all text mode
bad:
needs .grapsrc
and change sh file and copy to bin
also fps is in text mode..

Anandtech Frame-Getter

"The program computes frames per second for an application that uses OpenGL or SDL. It also takes screenshots periodically, and creates an overlay to display the current FPS/time.

"This is accomplished by defining a custom SwapBuffers function. For executables that are linked to GL at compile time, the LD_PRELOAD environment variable is used to invoke the custom SwapBuffers function. For executables that use run-time linking - which seems to be the case for most games - a copy of the binary is made, and all references to libGL and the original glXSwapBuffers function are replaced by references to our library and the custom SwapBuffers function. A similar procedure is done for SDL. We can then do all calculations on the frame buffer or simply dump the frame at will."

http://www.anandtech.com/linux/showdoc.aspx?i=2229&p=2

http://www.anandtech.com/linux/showdoc.aspx?i=2218

http://freshmeat.net/projects/libglfps/

anandtech
it's capturing screenshot every two seconds
the corner is not selectable
I have modified to source
//glTranslatef(4, g_frameData.viewPort[3] - GLUT_STROKE_HEIGHT * GLUT_STROKE_SCALE - 4, 0);
glTranslatef(4, GLUT_STROKE_HEIGHT * GLUT_STROKE_SCALE - 4, 0);
for low left fps display and commented
//WriteFps();
//TakeScreenShot(useSDL);
all in framedata.cpp
good:
it's in gl

using it:

export LD_PRELOAD=$HOME/anandtech-framegetter-0.1.0-src/bin/libFG.so
./oclNbody
export LD_PRELOAD=$HOME/graps/libgraps.so
./oclNbody

glc

Highlights

The complete source code is available and licenced under zlib-style licence.
Thread-based architechture takes full advantage of multicore-CPUs.
Support for multiple simultaneous audio and video streams.
Reads frames asynchronously from GPU using GL_ARB_pixel_buffer_object extension.
Does enforce fps cap only in the captured stream.
If the application can play audio using ALSA, glc can record it regardless of sound card's capabilities.
Support for recording voice to a separate audio stream.
Stores color correction information and applies it at playback.
Minimal application overhead (eg. slow HDD does not slow program down).
Fast arbitrary ratio video scaling with bilinear filtering.
Does colorspace conversion to Y'CbCr 420jpeg which cuts stream size in half.
Compresses stream with lightweight LZO or QuickLZ compression which saves additional 40%-60%.

http://kronixcb.blogspot.com/2008/02/glc-fraps-for-linux-how-to-install-glc.html

For Windows
Fraps 3.0
D3DGear

Posted in | No comments

opencl/opengl linux interop! seen in opencl cuda 3.0 sdk samples

Posted on 03:51 by Unknown

Following my OpenCL/OpenGL Window interop work:
now has come to Linux for Nvidia GPU computing registered developers via 195.17 driver!
Also found in the simpleGL is the key to enabling it:
cl_context_properties props[] = {CL_GL_CONTEXT_KHR, (cl_context_properties)glXGetCurrentContext(), CL_GLX_DISPLAY_KHR, (cl_context_properties) glXGetCurrentDisplay(), CL_CONTEXT_PLATFORM, (cl_context_properties)cpPlatform, 0};
Remember Windows is:
cl_context_properties akProperties[] = {
CL_GL_CONTEXT_KHR,
(cl_context_properties)wglGetCurrentContext(),
CL_WGL_HDC_KHR,
(cl_context_properties)wglGetCurrentDC(), 0
};

Posted in | No comments

AMD OpenCl forums (I)

Posted on 02:36 by Unknown

*For me both "-cl-opt-disable" and "-op-disable" work.
But those options do not disable kernel optimizations

*Access to memory with volatile qualifier is optimized out by compiler ( only first access is generated to IL/ISA , following reads from the same address are removed ).
Sample code
global volatile flot4* v;
a1 = v[0]; <- this read is generated
a2 = v[0]; <- optimized out
a3 = v[0]; <- optimized out

*It is not that the threads in OpenCL are tiled, but blocked linear.
Linear is sequential along a single line, blocked linear is sequential within a block and each block is sequential. In tiled mode it usually follows a hierarchical Z pattern.

Posted in | No comments

About CUDA 3.0 (II)

Posted on 02:36 by Unknown

*Includes opencl profiler 1.1

New Fermi features:

surface functions (read/writable textures):
__device__ __surf{1D,2D}{read,write}{s,u,c}{1,2,4}
__device__ __surf{1D,2D}{read,write}l{1,2}

c=char
u=uint
s=ushort

Where is 3D surfaces i.e. 3D writable textures?

device functions:

extern __device__ void                   __threadfence_system(void);
extern __device__ double                __ddiv_rn(double, double);
extern __device__ double                __ddiv_rz(double, double);
extern __device__ double                __ddiv_ru(double, double);
extern __device__ double                __ddiv_rd(double, double);
extern __device__ double                __drcp_rn(double);
extern __device__ double                __drcp_rz(double);
extern __device__ double                __drcp_ru(double);
extern __device__ double                __drcp_rd(double);
extern __device__ double                __dsqrt_rn(double);
extern __device__ double                __dsqrt_rz(double);
extern __device__ double                __dsqrt_ru(double);
extern __device__ double                __dsqrt_rd(double);

extern __device__ unsigned int          __ballot(int);
extern __device__ int                   __syncthreads_count(int);
extern __device__ int                   __syncthreads_and(int);
extern __device__ int                   __syncthreads_or(int);

extern __device__ long long int         clock64(void);

extern __device__ float                 __fmaf_ieee_rn(float, float, float);
extern __device__ float                 __fmaf_ieee_rz(float, float, float);
extern __device__ float                 __fmaf_ieee_ru(float, float, float);
extern __device__ float                 __fmaf_ieee_rd(float, float, float);

Key changes in version cudaprof v3.0 beta with respect to v2.3:
1) New counters "NOP Triggers" are added in "Session Settings" Dialog on
   the "Profiler counters" tab

2) New memory copy option "host mem transfer type" is added in "Session Settings"
   dialog on "Other Options" tab. This specifies whether a memory transfers uses
   "Pageable" or "Page-locked"

3) Device level summary plot :
   One bar for each method is there. Bars are sorted in decreasing gpu time. Bar length
   is proportional to cumulative gputime for a method across all contexts for a device.

4) Session level summary plot :
   One bar for each device is there. Bar length is proportional to Gpu Utilization.
   Gpu Utilization is the proportion of time when gpu was actually executing some method
   to total time interval from gpu start to end. The values are presented in percentage.

5) User interface changes:
   "Session Settings" Dialog :
   a) Added a new device selection option on "Session" tab.
      Based on this option the available counters can be selected on "Profiler Counter" tab.
      In case of "multi-device" only counters supported by all devices can be selected.
   b) All the counters on "Profiler Counter" tab and options on "Other Options" tab are shown
      in tree view under different groups.

Posted in | No comments