GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Thursday, 29 October 2009

News from the web!

Posted on 19:04 by Unknown
Some things learned in AMD forums:

1.Why 3xxx no OpenCL:

Compute shader mode is a hardware feature that did not exist in the HD38XX line of cards. It was introduced with the HD4870. There are also quite a few differences between the SIMD units, one being data sharing between threads as added by HD48XX and improved greatly in the HD5XXX line. Currently our OpenCL implementation does not use hardware local memory on the supported HD4XXX cards because it is owners write and not a fully generate model like HD5XXX.

2. About req_work_grop_size
reqd_work_group_size lets the compiler make assumptions and optimizations about the program that cannot be made when using a dynamic local work size as specified in clNDEnqueueRangeKernel.

3.CAL Peaks flops demo:
http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=119351&STARTPAGE=2&FTVAR_FORUMVIEWTMP=Linear

I will add to my OpenCL and CUDA peak flops code.
for r8xx
Change #define NR_GROUPS 10 to #define NR_GROUPS 20 ( as 5870 has 20 simd cores )
also use cal info to use dinamically

for 3xx hardware:

In order to run that code on the HD3850, you need to change a few things.
il_cs <-- this must be il_ps
; must be equal to THREADS_PER_GRP
dcl_num_thread_per_group 512 <-- this must be removed
vaTid.x <-- this must be vObjIndex.x

Also, you need to declare your vObjIndex in the kernel like some of the cal samplers.
You also need to change from calctxRunProgramGrid to calctxRunProgram
This won't work unmodified on that card because it does not have hardware compute shader.
about time
Thanks! Got the value of 2660Gflops but time elapsed remained the same(about 6.5secs). I tried with number of threads per SIMD in range 64-256 (wavefrontsize = 64) and i got much better time with fewer threads but fewer gflops as well. The best combination was with 256 threads where time was getting close to the half (3.4secs) and gflops a bit lower than maximum (about 2600Gflops). So, is it all about overhead or something;s wrong with the app / CAL intialiazations?


4.About Warp in AMD?
Wavefront size
Actually the wavefront size is only 64 for the highend cards(48XX, 58XX, 57XX), but 32 for the middleend cards and 16 for the lowend cards. You can query via CAL for information on your specific card.
Don't possible query in OpenCL but yes for nvidia opencl extensions warp size.

The best you can do is query CL_KERNEL_WORK_GROUP_SIZE with the clGetKernelWorkGroupInfo API call and that will tell you what the largest size you can execute and the wavefront/warp size most likely will be an integer multiple of that value. In some cases that value will equal to the wavefront/warp size, but that is usually because of resource constraints.


5.number CPU cores
try set enviroment variable CPU_MAX_COMPUTE_UNITS to number of cores you want use.
6.In OpenCL kernels:
error: invalid type conversion

uint* s = (uint*)(structData + ndx * structSz);

The problem is you are converting between address spaces, which is illegal in OpenCl. The correct way to do this is as follows:
__kernel void test(__global char *structData, uint structSz)
{
uint ndx = get_local_id(0);
global uint* s = (global uint*)(structData + ndx *structSz);
*s = ndx;
}
Thanks, that fixes it. Interestingly, the NVidia compiler is not insistent on the memory specifier, but fails later.

7. Don't create binaries:
Presently, clCreateProgramWithSource is only supported. you can do what you are expecting from clCreateProgramWithBinary. This will be available in upcoming releases.

8 porting radix sort nvidia demo:

they have an error as me
http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=120879&highlight_key=y

*Beyond3d Forums:

1. OpenCL kernels not need to specify local work group size in CS
This is something I think CS should support. In OpenCL, work group size can be decided by the implementation. For example, if you want to perform computation on one million numbers, you can just tell it to create one million work items, and the implementation should automatically decide how many work items a work group should have. In a sense this is very similar to a pixel shader.

2. Computer shaders can write to backbuffer directly.

RWTexture3D issues:
I have another version running twice as fast by storing the volumes as 4 channel floats instead of single channel.
It get's a little messy then to update the sources as you can not write to single channels of an unordered access texture resource, sigh ...
Also the rendering then can not make use of the hardware texture filtering.
There seems no way to type cast a 4 channel view in a 1 channel view....
Reading from a RWTexture3D is also impossible if it is rgba, can be only single channel.
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Porting CUDA to OpenCL!
    Well so you want to port CUDA code to OpenCL: you are in AMD GPU competition of porting Cuda codes to opencl (see previous post) or you are ...
  • Megapost!
    Today fools{ *GTX 485 is 512 cores 3gbytes gddr5 and 850/1750 shaders.. *ati 5990 has 4 gpus in board.. *bulldozer benchmarks }end fools.. A...
  • About ATI and Nvidia drivers (OCL included)!
    Hi I have been investigating AMD and Nvidia drivers.. for 10.3 there are 3d hooks support for 120hz monitors but is d3d9 d3d10 or d3d11 enab...
  • things found in CUDA forums
    Also some CUDA news: Mandelbulb stereo angalyph -> have to port to 3D Vision http://forums.nvidia.com/index.php?showtopic=150985&st=2...
  • opencl/opengl linux interop! seen in opencl cuda 3.0 sdk samples
    Following my OpenCL/OpenGL Window interop work: now has come to Linux  for Nvidia GPU computing registered developers via 195.17 driver! Als...
  • State of the blog..
    Sorry for the delay guys of posting code of Apple OpenCL demos port.. the blog has been with no updated for more than 2 weeks in this rapid ...
  • Optix and OpenCL SDKs with Visual Studio 2010
    Optix 1.0 ========= install cg download Cmake 2.80 cmake says error dumpbin not found and it is cuda doesn't work with vc2010 so copy pt...
  • CUDA 3.0 forums stuff!
    1.Getting CUBIN instead of ELF If you need the older text format, you can disable ELF cubins in nvcc.profile by changing "CUBINS_ARE_EL...
  • News from the web!
    Some things learned in AMD forums: 1.Why 3xxx no OpenCL: Compute shader mode is a hardware feature that did not exist in the HD38XX line of ...
  • Shaders: measuring perf, source translation and parsing different languages!
    Hi, I hope to be pretty exhaustive of options for parsing and translating between graphics and compute shaders ( some open source) For DX sh...

Blog Archive

  • ►  2013 (5)
    • ►  September (1)
    • ►  March (3)
    • ►  February (1)
  • ►  2012 (1)
    • ►  December (1)
  • ►  2010 (46)
    • ►  July (4)
    • ►  May (1)
    • ►  April (3)
    • ►  March (9)
    • ►  February (15)
    • ►  January (14)
  • ▼  2009 (125)
    • ►  December (51)
    • ►  November (53)
    • ▼  October (21)
      • IBM OpenCL support!
      • Whises for OpenCL 1.1 and more!
      • 3D Vision and Direct3D 11
      • H264 harware decoding/ encoding GPUs
      • Interop GPU computing graphics apis stuff
      • 3D vision good stuff
      • Getting PTX, AMD_IL from languages:
      • Updated CUBLAS before CUDA 3.0
      • ATI and Nvidia extensions for DX11 and 10.1 ARBs
      • OpenGL 5870 extensions
      • Nvidia 195
      • News from the web!
      • Mem export in OpenCL
      • Double precision support in GPU computing APIs and...
      • Support 3d image write on CUDA and with OpenCL wra...
      • About OpenCL OpenGL interop..
      • Improved OpenCL-Z!
      • About binary compatiblity on OpenCL..
      • AMD IL backend for LLVM and getting AMD IL in MacOS?
      • Building OpenCL kernels from AMD IL code or device...
      • A CAL wrapper for getting AMD IL from OpenCL AMD G...
Powered by Blogger.

About Me

Unknown
View my complete profile