1.Seems that AMD imp avoid running CPU kernels and GPU kernels simultaneously altough running asyinc on different queues.. Seems is serialized and can be considered if true a perf issue..
search forums:
Possible to run OpenCL code on GPU and CPU concurrently?
seems not
Further information: I checked the
CL_PROFILING_COMMAND_QUEUED, CL_PROFILING_COMMAND_SUBMIT, CL_PROFILING_COMMAND_START, and CL_PROFILING_COMMAND_END
for each kernel, and the second kernel (the CPU kernel) is indeed waiting until the first kernel (the GPU kernel) finishes before it gets submitted. Both end up in the queue immediately (and there are two command queues), but the second doesn't get submitted until the first finishes.
2.AMD samples are noted to be not optimized but you can get more performance by minor tweaks:
I thought I'd post a little update on this. Once I delved into the code a bit more, I found that the default block size was 8. Once I changed this (and once I modified the code so it didn't give me an error that it was set too high), many of the examples run much faster on the gpu than before.
in another thread was suggested that group size should by equal to wavefront size which is 64 for 48xx and 58xx.
3. Seems a perf issue altough using OpenCL bandiwth test (I use mapped and can be so good because I own a Nehalem and bandwith is also so good on Nvidia without using pinned mem) because of nvidia sdk you get high bandwitdh (only d2h or h2d on this sample d2d seems is not good)
I use clEnqueueRead/WriteBuffer with blocking mode on Radeon HD 5750.
But wrute throughput is lower than result of PCIeSpeedTest(ATI Stream Power Toys).
And read throughput is very lower than write throughput. why ?
Test pseudocode:
size = 1024*1024*64;
NUM_TIMING_LOOPS = 100;
buf = clCreateBuffer(context,CL_MEM_READ_WRITE,size,NULL,&errcode);
stopwatch.start (); // use PerformanceCounter
for (int i = 0; i < NUM_TIMING_LOOPS; i ++) clEnqueueWriteBuffer(queue,buf,CL_TRUE,0,size,ptr,0,NULL,NULL); stopwatch.stop (); printf (...); Result: write: 2.575GB/s read: 1.197GB/s PCIeSpeedTestResult (v0.2): [ 67108864 bytes] CPU->GPU= 4.851 GB/sec, GPU->CPU= 861.791 MB/sec
Confirmation of OpenCL perf issue:
This is because of the difference in implementation of PCIeSpeedTest and OpenCL. The PCIe Speedtest goes directly to pinned memory while the OpenCL version copies to PCIe and then to the user memory. We are working on a more optimized path that can avoid this copy under certain conditions in a future release.
4.Nvidia provides OpenCL visual profiler and Amd is working on similar tools:
We'll be providing an MSVS-integrated profiler that will be capable of reporting the profiling counters in the next release. In the next few months, we'll also provide a Stream Kernel Analyzer that will accept OpenCL C for static analysis of your kernels.
Meanwhile use the solution in my first post on the blog to get kernels in AMD IL code..
5. printf works in CPU kernels in Linux backend (in Apple there is a similar debug extension) OpenCL.
DUMP
Yes printf currently is only supported on the CPU device as there is no standard library in OpenCL that contains the printf function in GPU , so it is not valid on every device. This is stated in 6.8.f of the OpenCL 1.0 spec. Apple does support printf in the kernel as a standard debug strategy when using the CPU device.
Here is what I did to get printf to work within my kernels (I am using OpenSUSE though). I just put the stdio.h file in my working directory
Code:
const char *header = "-I stdio.h\0";
err = clBuildProgram(program, 1, devices, header, NULL, NULL);
on GPU? because i have 9.9 and on CPU work too.
of course gpu ... 9.9 didnt support opencl seems like 9.11 does.
6.AMD OpenCL doesn't work with MingW.
7.AMD reports supported OpenCL devices for R6xx cards altough in then fails:
Profiling : Yes
Platform ID: 00000000
Name: ATI RV610
Vendor: Advanced Micro Devices, Inc.
Driver version: CAL 1.4.467
Profile: FULL_PROFILE
Version: OpenCL 1.0 ATI-Stream-v2.0-beta4
Extensions:
Thanks for reporting this. The 6XX series of cards do not have the required hardware to execute OpenCL kernels, so this should not have been displayed as available for execution.
8. example of a reduction on 3 pass using shared registers. He had problems getting to work seems the key issue is:
Shared register not updated as it ought to be..
Answer:
Just went through our documentation. One very important piece of information is left out that will fix your problems. Access to shared registers is only atomic if done in a single instruction.
i.e.
iadd sr0, sr0, sr1 is correct
but
mov r0, sr0
mov r1, sr1
iadd r2, r0, r1
mov sr0, r2 is incorrect because of the even/odd wavefront issue.
Thursday, 26 November 2009
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment