News from the web! ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Some things learned in AMD forums:

1.Why 3xxx no OpenCL:

Compute shader mode is a hardware feature that did not exist in the HD38XX line of cards. It was introduced with the HD4870. There are also quite a few differences between the SIMD units, one being data sharing between threads as added by HD48XX and improved greatly in the HD5XXX line. Currently our OpenCL implementation does not use hardware local memory on the supported HD4XXX cards because it is owners write and not a fully generate model like HD5XXX.

2. About req_work_grop_size
reqd_work_group_size lets the compiler make assumptions and optimizations about the program that cannot be made when using a dynamic local work size as specified in clNDEnqueueRangeKernel.

3.CAL Peaks flops demo:
http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=119351&STARTPAGE=2&FTVAR_FORUMVIEWTMP=Linear

I will add to my OpenCL and CUDA peak flops code.
for r8xx
Change #define NR_GROUPS 10 to #define NR_GROUPS 20 ( as 5870 has 20 simd cores )
also use cal info to use dinamically

for 3xx hardware:

In order to run that code on the HD3850, you need to change a few things.
il_cs <-- this must be il_ps
; must be equal to THREADS_PER_GRP
dcl_num_thread_per_group 512 <-- this must be removed
vaTid.x <-- this must be vObjIndex.x

Also, you need to declare your vObjIndex in the kernel like some of the cal samplers.
You also need to change from calctxRunProgramGrid to calctxRunProgram
This won't work unmodified on that card because it does not have hardware compute shader.
about time
Thanks! Got the value of 2660Gflops but time elapsed remained the same(about 6.5secs). I tried with number of threads per SIMD in range 64-256 (wavefrontsize = 64) and i got much better time with fewer threads but fewer gflops as well. The best combination was with 256 threads where time was getting close to the half (3.4secs) and gflops a bit lower than maximum (about 2600Gflops). So, is it all about overhead or something;s wrong with the app / CAL intialiazations?

4.About Warp in AMD?
Wavefront size
Actually the wavefront size is only 64 for the highend cards(48XX, 58XX, 57XX), but 32 for the middleend cards and 16 for the lowend cards. You can query via CAL for information on your specific card.
Don't possible query in OpenCL but yes for nvidia opencl extensions warp size.

The best you can do is query CL_KERNEL_WORK_GROUP_SIZE with the clGetKernelWorkGroupInfo API call and that will tell you what the largest size you can execute and the wavefront/warp size most likely will be an integer multiple of that value. In some cases that value will equal to the wavefront/warp size, but that is usually because of resource constraints.

5.number CPU cores
try set enviroment variable CPU_MAX_COMPUTE_UNITS to number of cores you want use.
6.In OpenCL kernels:
error: invalid type conversion

uint* s = (uint*)(structData + ndx * structSz);

The problem is you are converting between address spaces, which is illegal in OpenCl. The correct way to do this is as follows:
__kernel void test(__global char *structData, uint structSz)
{
uint ndx = get_local_id(0);
global uint* s = (global uint*)(structData + ndx *structSz);
*s = ndx;
}
Thanks, that fixes it. Interestingly, the NVidia compiler is not insistent on the memory specifier, but fails later.

7. Don't create binaries:
Presently, clCreateProgramWithSource is only supported. you can do what you are expecting from clCreateProgramWithBinary. This will be available in upcoming releases.

8 porting radix sort nvidia demo:

they have an error as me
http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=120879&highlight_key=y

*Beyond3d Forums:

1. OpenCL kernels not need to specify local work group size in CS
This is something I think CS should support. In OpenCL, work group size can be decided by the implementation. For example, if you want to perform computation on one million numbers, you can just tell it to create one million work items, and the implementation should automatically decide how many work items a work group should have. In a sense this is very similar to a pixel shader.

2. Computer shaders can write to backbuffer directly.

RWTexture3D issues:
I have another version running twice as fast by storing the volumes as 4 channel floats instead of single channel.
It get's a little messy then to update the sources as you can not write to single channels of an unordered access texture resource, sigh ...
Also the rendering then can not make use of the hardware texture filtering.
There seems no way to type cast a 4 channel view in a 1 channel view....
Reading from a RWTexture3D is also impossible if it is rgba, can be only single channel.

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Thursday, 29 October 2009

News from the web!

0 comments:

Post a Comment

Popular Posts

Blog Archive

About Me