GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Saturday, 28 November 2009

Interesting AMD Stream forums posts! (old posts)

Posted on 02:55 by Unknown
I have compiled over time some interesting post links in AMD Stream forums: some explain things at the time not documented, some others interesting code, or projects done or using it, etc..

Here it goes..

Released GPUwareC with GWSDK 0.5.1 test

GPUware C 0.5 test release

The GPUware C compiler allows one to code in a C-like language to construct AMD GPU kernel calls. It allows one to program with AMD's CAL, to produce high performance software, and still code kernels in a high level language. The AMD IL produced by the compiler is very readable, and can be easily modified by hand.

Can LDS reads be "broadcast" within a wavefront"?


LDS : More info requested

Features of Stream SDK 1.2?


Double memory copy in CAL ? What about calCtxResCreate ?

GPU memory architecture?

Measuring HD 4850 performance 1tflop shader

You can have max 128 registers per thread. Number of wavefronts that can be executed on a single SIMD is decided by register usage in your shader (Total registers per SIMD are 64*256).

bursting global reads and global memory bandwidth?

global GPR vs. global data store (another)


As for PV/PS, you cannot turn them off and you really would not want to turn them off as they provide a performance bonus over normal register usage.
Maximum 2D stream dimensions supported is 8192x8192 and 1D dimensions suported is 2^26.

Either you can rearrange data to match these dimensions or you can try changing algorithm to execute data tile-by-tile on GPU (Take a look at out of core MMM in samples/CPP/apps). 4870 is also having the same limitation.

About r7xx arch
Ok,
So there are 163840 registers on the RV770. There are 10 SIMD's, so that gives us 16384 registers per simd, or 16K x 128bit as specified in the Registers per SIMD Core row.
Now, the article states right above the table that there are 64 threads per wavefront. So, 16384 / 64 gives you 256 registers per thread.
If you run a problem domain of 1026 * 1026, assuming 1 thread per location, that gives you 1,052,676 threads that need to be executed.
Divide that by the wavefront size, gives you 16449(must round up) wavefronts that will be spawned by the GPU for this domain.
Now, lets assume that you have 5 registers per thread(which can be determined from KSA disassembly), this lets you run a MAX of (256/5) = 51 wavefronts in parallel per SIMD, or 510 at a time on the GPU.
So this means that you have enough wavefronts to fill up the GPU at least 32 times.

So, assuming that your application gets all of the resources on the chip, this is what you should expect. However, because of other constraints this is the best case scenario and not the average case. So this should give you some idea about what you can do.

Hope this helps. That review article is fairly well done and if you analyze it with a compute mindset you can figure out a lot of things that are docs don't currently specify.
http://www.anandtech.com/printarticle.aspx?i=3341


Using 4870x2
============

Many wasted hours later, I think I've found my problem. I needed to call calCtxIsEventDone() after calling the kernel for each GPU to allow the concurrency to occur. Seems like a messy trick - perhaps the multiGPU paragraph in the user guide could be expanded to mention this?

Take a look at section 2.16.3 of stream computing user guide to see how to use multiple GPUs in single thread. I would suggest to create seperate threads for multiple GPUs as leveraging kernel asynchronous call requires lots of tuning and the call might not be asyncronous in some cases. Take a look at Brook+ sample MonteCarlo_MultiGPU and tutorial MultiGPU.

Measuring CAL time
==================

The correct way to time a CAL kernel is to follow this pattern:

flush
start timer
execute kernel
wait on event
stop timer

As for PV/PS, you cannot turn them off and you really would not want to turn them off as they provide a performance bonus over normal register usage.
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Porting CUDA to OpenCL!
    Well so you want to port CUDA code to OpenCL: you are in AMD GPU competition of porting Cuda codes to opencl (see previous post) or you are ...
  • Megapost!
    Today fools{ *GTX 485 is 512 cores 3gbytes gddr5 and 850/1750 shaders.. *ati 5990 has 4 gpus in board.. *bulldozer benchmarks }end fools.. A...
  • About ATI and Nvidia drivers (OCL included)!
    Hi I have been investigating AMD and Nvidia drivers.. for 10.3 there are 3d hooks support for 120hz monitors but is d3d9 d3d10 or d3d11 enab...
  • things found in CUDA forums
    Also some CUDA news: Mandelbulb stereo angalyph -> have to port to 3D Vision http://forums.nvidia.com/index.php?showtopic=150985&st=2...
  • opencl/opengl linux interop! seen in opencl cuda 3.0 sdk samples
    Following my OpenCL/OpenGL Window interop work: now has come to Linux  for Nvidia GPU computing registered developers via 195.17 driver! Als...
  • State of the blog..
    Sorry for the delay guys of posting code of Apple OpenCL demos port.. the blog has been with no updated for more than 2 weeks in this rapid ...
  • Optix and OpenCL SDKs with Visual Studio 2010
    Optix 1.0 ========= install cg download Cmake 2.80 cmake says error dumpbin not found and it is cuda doesn't work with vc2010 so copy pt...
  • CUDA 3.0 forums stuff!
    1.Getting CUBIN instead of ELF If you need the older text format, you can disable ELF cubins in nvcc.profile by changing "CUBINS_ARE_EL...
  • News from the web!
    Some things learned in AMD forums: 1.Why 3xxx no OpenCL: Compute shader mode is a hardware feature that did not exist in the HD38XX line of ...
  • Shaders: measuring perf, source translation and parsing different languages!
    Hi, I hope to be pretty exhaustive of options for parsing and translating between graphics and compute shaders ( some open source) For DX sh...

Blog Archive

  • ►  2013 (5)
    • ►  September (1)
    • ►  March (3)
    • ►  February (1)
  • ►  2012 (1)
    • ►  December (1)
  • ►  2010 (46)
    • ►  July (4)
    • ►  May (1)
    • ►  April (3)
    • ►  March (9)
    • ►  February (15)
    • ►  January (14)
  • ▼  2009 (125)
    • ►  December (51)
    • ▼  November (53)
      • Two big games coming today: State of the art Direc...
      • News from the web (IV) (big compilation)
      • Wishes in GPU drivers before Q2 2009!
      • CUDA Atomics perf!
      • GPU Compute benchmark results!
      • Interesting AMD Stream forums posts! (old posts)
      • Testing my apps with 8600GTS and WinXP!
      • A lot of Catalyst AMD drivers!
      • News from the web III
      • News from the web II (big compilation)
      • News from OpenCL forums!
      • Bugs in OpenGL AMD drivers: Geometry shader and te...
      • Testing LDS perf in OpenCL!
      • OpenCL bugs!
      • Benchmarking OpenCL and DirectCompute!
      • Benchmarking stientific kernels on OpenCL!
      • News from the web!
      • OpenCL learning and tutorials!
      • Porting CUDA to OpenCL!
      • GPU computing programming contests..
      • AMD 5xxx series overclocking..
      • OpenCL on Apple: update!
      • State of the blog..
      • Places where OpenCL shines!
      • Running Optix with Geforce in Linux
      • New exciting soft and info coming this year!
      • Matmul bench for CUDA, CAL, and MultiCore CPUs!
      • More than 10 places where DX Compute 5.0 is better...
      • CUDA 3.0 has CUBLAS functions for MAGMA with compl...
      • About IBM OpenCL
      • OpenGL interop perf in CUDA and OCL in Linux
      • Fraps like for Linux and for Windows DX11!
      • opencl/opengl linux interop! seen in opencl cuda 3...
      • AMD OpenCl forums (I)
      • About CUDA 3.0 (II)
      • About CUDA 3.0 (I)
      • CAL 2.0 vs 1.4 API
      • Naive OpenCL benchmarks..
      • Managing AMD OpenCL GPU devices and OpenCL backend...
      • About Xvba VAAPI backend..
      • CUDA 3.0 released
      • About Khronos ICD model..
      • Exploring Nvidia OpenCL 195.39 drivers:Bugs , perf...
      • Nvidia OpenCL samples with AMD OpenCL drivers!
      • Nvidia OpenCL samples on Nvidia 195 OpenCL drivers!!
      • AMD OpenCL samples on Nvidia 195 OpenCL drivers!!
      • Optix and OpenCL SDKs with Visual Studio 2010
      • OpenCL on AMD GPUs!
      • Dreaming about Ubuntu 10.04
      • News from the web!
      • OpenCL-z is here!
      • Port of Apple demos to Windows..
      • Shared memory names..
    • ►  October (21)
Powered by Blogger.

About Me

Unknown
View my complete profile