GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Thursday, 26 November 2009

News from OpenCL forums!

Posted on 02:21 by Unknown
1.Seems that AMD imp avoid running CPU kernels and GPU kernels simultaneously altough running asyinc on different queues.. Seems is serialized and can be considered if true a perf issue..
search forums:
Possible to run OpenCL code on GPU and CPU concurrently?
seems not
Further information: I checked the

CL_PROFILING_COMMAND_QUEUED, CL_PROFILING_COMMAND_SUBMIT, CL_PROFILING_COMMAND_START, and CL_PROFILING_COMMAND_END

for each kernel, and the second kernel (the CPU kernel) is indeed waiting until the first kernel (the GPU kernel) finishes before it gets submitted. Both end up in the queue immediately (and there are two command queues), but the second doesn't get submitted until the first finishes.

2.AMD samples are noted to be not optimized but you can get more performance by minor tweaks:
I thought I'd post a little update on this. Once I delved into the code a bit more, I found that the default block size was 8. Once I changed this (and once I modified the code so it didn't give me an error that it was set too high), many of the examples run much faster on the gpu than before.

in another thread was suggested that group size should by equal to wavefront size which is 64 for 48xx and 58xx.

3. Seems a perf issue altough using OpenCL bandiwth test (I use mapped and can be so good because I own a Nehalem and bandwith is also so good on Nvidia without using pinned mem) because of nvidia sdk you get high bandwitdh (only d2h or h2d on this sample d2d seems is not good)

I use clEnqueueRead/WriteBuffer with blocking mode on Radeon HD 5750.
But wrute throughput is lower than result of PCIeSpeedTest(ATI Stream Power Toys).
And read throughput is very lower than write throughput. why ?

Test pseudocode:
size = 1024*1024*64;
NUM_TIMING_LOOPS = 100;
buf = clCreateBuffer(context,CL_MEM_READ_WRITE,size,NULL,&errcode);
stopwatch.start (); // use PerformanceCounter
for (int i = 0; i < NUM_TIMING_LOOPS; i ++) clEnqueueWriteBuffer(queue,buf,CL_TRUE,0,size,ptr,0,NULL,NULL); stopwatch.stop (); printf (...); Result: write: 2.575GB/s read: 1.197GB/s PCIeSpeedTestResult (v0.2): [ 67108864 bytes] CPU->GPU= 4.851 GB/sec, GPU->CPU= 861.791 MB/sec

Confirmation of OpenCL perf issue:
This is because of the difference in implementation of PCIeSpeedTest and OpenCL. The PCIe Speedtest goes directly to pinned memory while the OpenCL version copies to PCIe and then to the user memory. We are working on a more optimized path that can avoid this copy under certain conditions in a future release.

4.Nvidia provides OpenCL visual profiler and Amd is working on similar tools:
We'll be providing an MSVS-integrated profiler that will be capable of reporting the profiling counters in the next release. In the next few months, we'll also provide a Stream Kernel Analyzer that will accept OpenCL C for static analysis of your kernels.
Meanwhile use the solution in my first post on the blog to get kernels in AMD IL code..

5. printf works in CPU kernels in Linux backend (in Apple there is a similar debug extension) OpenCL.
DUMP
Yes printf currently is only supported on the CPU device as there is no standard library in OpenCL that contains the printf function in GPU , so it is not valid on every device. This is stated in 6.8.f of the OpenCL 1.0 spec. Apple does support printf in the kernel as a standard debug strategy when using the CPU device.
Here is what I did to get printf to work within my kernels (I am using OpenSUSE though). I just put the stdio.h file in my working directory
Code:
const char *header = "-I stdio.h\0";
err = clBuildProgram(program, 1, devices, header, NULL, NULL);

on GPU? because i have 9.9 and on CPU work too.
of course gpu ... 9.9 didnt support opencl seems like 9.11 does.

6.AMD OpenCL doesn't work with MingW.

7.AMD reports supported OpenCL devices for R6xx cards altough in then fails:

Profiling : Yes
Platform ID: 00000000
Name: ATI RV610
Vendor: Advanced Micro Devices, Inc.
Driver version: CAL 1.4.467
Profile: FULL_PROFILE
Version: OpenCL 1.0 ATI-Stream-v2.0-beta4
Extensions:
Thanks for reporting this. The 6XX series of cards do not have the required hardware to execute OpenCL kernels, so this should not have been displayed as available for execution.

8. example of a reduction on 3 pass using shared registers. He had problems getting to work seems the key issue is:
Shared register not updated as it ought to be..
Answer:
Just went through our documentation. One very important piece of information is left out that will fix your problems. Access to shared registers is only atomic if done in a single instruction.

i.e.
iadd sr0, sr0, sr1 is correct
but
mov r0, sr0
mov r1, sr1
iadd r2, r0, r1
mov sr0, r2 is incorrect because of the even/odd wavefront issue.
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Porting CUDA to OpenCL!
    Well so you want to port CUDA code to OpenCL: you are in AMD GPU competition of porting Cuda codes to opencl (see previous post) or you are ...
  • Megapost!
    Today fools{ *GTX 485 is 512 cores 3gbytes gddr5 and 850/1750 shaders.. *ati 5990 has 4 gpus in board.. *bulldozer benchmarks }end fools.. A...
  • About ATI and Nvidia drivers (OCL included)!
    Hi I have been investigating AMD and Nvidia drivers.. for 10.3 there are 3d hooks support for 120hz monitors but is d3d9 d3d10 or d3d11 enab...
  • things found in CUDA forums
    Also some CUDA news: Mandelbulb stereo angalyph -> have to port to 3D Vision http://forums.nvidia.com/index.php?showtopic=150985&st=2...
  • opencl/opengl linux interop! seen in opencl cuda 3.0 sdk samples
    Following my OpenCL/OpenGL Window interop work: now has come to Linux  for Nvidia GPU computing registered developers via 195.17 driver! Als...
  • State of the blog..
    Sorry for the delay guys of posting code of Apple OpenCL demos port.. the blog has been with no updated for more than 2 weeks in this rapid ...
  • Optix and OpenCL SDKs with Visual Studio 2010
    Optix 1.0 ========= install cg download Cmake 2.80 cmake says error dumpbin not found and it is cuda doesn't work with vc2010 so copy pt...
  • CUDA 3.0 forums stuff!
    1.Getting CUBIN instead of ELF If you need the older text format, you can disable ELF cubins in nvcc.profile by changing "CUBINS_ARE_EL...
  • News from the web!
    Some things learned in AMD forums: 1.Why 3xxx no OpenCL: Compute shader mode is a hardware feature that did not exist in the HD38XX line of ...
  • Shaders: measuring perf, source translation and parsing different languages!
    Hi, I hope to be pretty exhaustive of options for parsing and translating between graphics and compute shaders ( some open source) For DX sh...

Blog Archive

  • ►  2013 (5)
    • ►  September (1)
    • ►  March (3)
    • ►  February (1)
  • ►  2012 (1)
    • ►  December (1)
  • ►  2010 (46)
    • ►  July (4)
    • ►  May (1)
    • ►  April (3)
    • ►  March (9)
    • ►  February (15)
    • ►  January (14)
  • ▼  2009 (125)
    • ►  December (51)
    • ▼  November (53)
      • Two big games coming today: State of the art Direc...
      • News from the web (IV) (big compilation)
      • Wishes in GPU drivers before Q2 2009!
      • CUDA Atomics perf!
      • GPU Compute benchmark results!
      • Interesting AMD Stream forums posts! (old posts)
      • Testing my apps with 8600GTS and WinXP!
      • A lot of Catalyst AMD drivers!
      • News from the web III
      • News from the web II (big compilation)
      • News from OpenCL forums!
      • Bugs in OpenGL AMD drivers: Geometry shader and te...
      • Testing LDS perf in OpenCL!
      • OpenCL bugs!
      • Benchmarking OpenCL and DirectCompute!
      • Benchmarking stientific kernels on OpenCL!
      • News from the web!
      • OpenCL learning and tutorials!
      • Porting CUDA to OpenCL!
      • GPU computing programming contests..
      • AMD 5xxx series overclocking..
      • OpenCL on Apple: update!
      • State of the blog..
      • Places where OpenCL shines!
      • Running Optix with Geforce in Linux
      • New exciting soft and info coming this year!
      • Matmul bench for CUDA, CAL, and MultiCore CPUs!
      • More than 10 places where DX Compute 5.0 is better...
      • CUDA 3.0 has CUBLAS functions for MAGMA with compl...
      • About IBM OpenCL
      • OpenGL interop perf in CUDA and OCL in Linux
      • Fraps like for Linux and for Windows DX11!
      • opencl/opengl linux interop! seen in opencl cuda 3...
      • AMD OpenCl forums (I)
      • About CUDA 3.0 (II)
      • About CUDA 3.0 (I)
      • CAL 2.0 vs 1.4 API
      • Naive OpenCL benchmarks..
      • Managing AMD OpenCL GPU devices and OpenCL backend...
      • About Xvba VAAPI backend..
      • CUDA 3.0 released
      • About Khronos ICD model..
      • Exploring Nvidia OpenCL 195.39 drivers:Bugs , perf...
      • Nvidia OpenCL samples with AMD OpenCL drivers!
      • Nvidia OpenCL samples on Nvidia 195 OpenCL drivers!!
      • AMD OpenCL samples on Nvidia 195 OpenCL drivers!!
      • Optix and OpenCL SDKs with Visual Studio 2010
      • OpenCL on AMD GPUs!
      • Dreaming about Ubuntu 10.04
      • News from the web!
      • OpenCL-z is here!
      • Port of Apple demos to Windows..
      • Shared memory names..
    • ►  October (21)
Powered by Blogger.

About Me

Unknown
View my complete profile