GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Thursday, 5 November 2009

Exploring Nvidia OpenCL 195.39 drivers:Bugs , performance issues, lacking extensions, suggestions..

Posted on 12:17 by Unknown
Hi,
I suppose and I checked Nvidia first (for consumers) 195 OpenCL drivers are very stable, generally fast and as some of you probably know have:

*Double precision
*OpenGL interop

anyway I have found some performance issues, bugs? and other things I want to say

This were uncovered testing AMD samples (more details):
http://oscarbg.blogspot.com/2009/11/amd-op...195-opencl.html

before anything know that perf is equivalent to CUDA and DX compute by now:
Nvidia Nbody demos for CUDA,OCL and DX Compute gets roughly on par at near 500Gflops on an OC GTX 275..
Anyway say 5-10% lowest perf is in OCL demo but probably because isn't using any graphics interop, CUDA use it and DX compute probably..

Also oclBandwithtest and cudabandwithtest reports are very similar..

Bugs
Note all of this works in AMD implementation:

1
=
I have found that a kernel without parameters (__kernel void main())
ok toy example but amd uses it in HelloCL sample, returns:

:5: error: a __kernel function cannot have varargs or stdargs
__kernel void

2.
related to uint4 to float4 conversion

a kernel having
temp1 = ((float4)(temp[i])) * one / intMax;
fails
we have to do this for working:
((float4)(temp[i].x,temp[i].y,temp[i].z,temp[i].w))

3. Related to math functions passing int doesn't
find correct function:

:35: error: no matching overload found for arguments of type 'int, int'
int mask = pow(2, k);
^~~
:45: error: no matching overload found for arguments of type 'int, int'
output[global_id] = temp / pow(2, 32);

:35: error: no matching overload found for arguments of type 'int, int'
outputImage[x + y * width] = hypot(Gx,Gy)/2;

FIX:change you parameters to float putting (float)
pow((float)2,(float)k)
hypot((float)Gx,(float)Gy)/2;

I have seen this warning :
:10: warning: unknown '#pragma OPENCL EXTENSION' - ignored
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable
is this correct?

Performance issues:
================

I have not explored to full extent the 3D Volume texture sample but is one of the remaining samples that goes very slow compared to CUDA..
I remember say 150fps vs 14fps..

I hope that were related to not working OpenGL interop in previous drivers..
I have enabled using GL_INTEROP and creating a OpenGL enabled context:
cl_context_properties akProperties[] = {
CL_GL_CONTEXT_KHR,
(cl_context_properties)wglGetCurrentContext(),
CL_WGL_HDC_KHR,
(cl_context_properties)wglGetCurrentDC(), 0
};

But the performance remains the same?
Can someone at Nvidia explain where that enormous difference in fps comes?
Testes on Windows 7 ..Is residing in WDDM model?

Suggestions and questions
=======================
I hope I'm no misunderstanding something..

Two OpenCL examples get an out of resources:

in AMD GPUs works..

1.Mandlebrot do the crazy thing of launch a global group of 65536 threads with 1dimension and with local workgroups of one element..
I have fixed reduced the resolution and gets 16K threads and is working..

Nvidia can support this if it were put in a 2D global group of 256x256.. Correct?

I think the limitation is hardware dependant an also exposed in CUDA but can't Nividia implement within the driver a loop executing as many as many local workgroups as they can in hardware in every step..
Theoretically the relaxation of the CUDA model doesn't permit this? as threads of different local workgroups have no other communication than finish kernel launches or via atomics..
Also is Fermi going to support that large 1D global groups?..


2. About shared memory

I have two questions (apply to CUDA also..)

AMD is emulating local mem in 4xxx via global mem ,
well I have a very big slowdown in perf but can Nvidia do that also so programs with are compiled running well on AMD backend run in Nvidia without changes..

The OpenCL driver knows shared memory resources required by the executable and what size GPU is so if greater using global mem
I know that can get complicated emulation for ex if the code using shared memory atomics and mem fences in shared mem
At least is possible?

Fermi helps with unified space?

About lacking extensions
====================
Also Nvidia I'm waiting for 3d image writes extension for enhanced perf in 3d lattice codes but I think it's all Fermi related so CUDA has no support for it and also DirectCompute 5.0 exposes RWTexture3D to new cards..
Corect?
Also what about 64 bit atomics are they supported in GT2xx cards in CUDA, no?
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Porting CUDA to OpenCL!
    Well so you want to port CUDA code to OpenCL: you are in AMD GPU competition of porting Cuda codes to opencl (see previous post) or you are ...
  • Megapost!
    Today fools{ *GTX 485 is 512 cores 3gbytes gddr5 and 850/1750 shaders.. *ati 5990 has 4 gpus in board.. *bulldozer benchmarks }end fools.. A...
  • About ATI and Nvidia drivers (OCL included)!
    Hi I have been investigating AMD and Nvidia drivers.. for 10.3 there are 3d hooks support for 120hz monitors but is d3d9 d3d10 or d3d11 enab...
  • things found in CUDA forums
    Also some CUDA news: Mandelbulb stereo angalyph -> have to port to 3D Vision http://forums.nvidia.com/index.php?showtopic=150985&st=2...
  • opencl/opengl linux interop! seen in opencl cuda 3.0 sdk samples
    Following my OpenCL/OpenGL Window interop work: now has come to Linux  for Nvidia GPU computing registered developers via 195.17 driver! Als...
  • State of the blog..
    Sorry for the delay guys of posting code of Apple OpenCL demos port.. the blog has been with no updated for more than 2 weeks in this rapid ...
  • Optix and OpenCL SDKs with Visual Studio 2010
    Optix 1.0 ========= install cg download Cmake 2.80 cmake says error dumpbin not found and it is cuda doesn't work with vc2010 so copy pt...
  • CUDA 3.0 forums stuff!
    1.Getting CUBIN instead of ELF If you need the older text format, you can disable ELF cubins in nvcc.profile by changing "CUBINS_ARE_EL...
  • News from the web!
    Some things learned in AMD forums: 1.Why 3xxx no OpenCL: Compute shader mode is a hardware feature that did not exist in the HD38XX line of ...
  • Shaders: measuring perf, source translation and parsing different languages!
    Hi, I hope to be pretty exhaustive of options for parsing and translating between graphics and compute shaders ( some open source) For DX sh...

Blog Archive

  • ►  2013 (5)
    • ►  September (1)
    • ►  March (3)
    • ►  February (1)
  • ►  2012 (1)
    • ►  December (1)
  • ►  2010 (46)
    • ►  July (4)
    • ►  May (1)
    • ►  April (3)
    • ►  March (9)
    • ►  February (15)
    • ►  January (14)
  • ▼  2009 (125)
    • ►  December (51)
    • ▼  November (53)
      • Two big games coming today: State of the art Direc...
      • News from the web (IV) (big compilation)
      • Wishes in GPU drivers before Q2 2009!
      • CUDA Atomics perf!
      • GPU Compute benchmark results!
      • Interesting AMD Stream forums posts! (old posts)
      • Testing my apps with 8600GTS and WinXP!
      • A lot of Catalyst AMD drivers!
      • News from the web III
      • News from the web II (big compilation)
      • News from OpenCL forums!
      • Bugs in OpenGL AMD drivers: Geometry shader and te...
      • Testing LDS perf in OpenCL!
      • OpenCL bugs!
      • Benchmarking OpenCL and DirectCompute!
      • Benchmarking stientific kernels on OpenCL!
      • News from the web!
      • OpenCL learning and tutorials!
      • Porting CUDA to OpenCL!
      • GPU computing programming contests..
      • AMD 5xxx series overclocking..
      • OpenCL on Apple: update!
      • State of the blog..
      • Places where OpenCL shines!
      • Running Optix with Geforce in Linux
      • New exciting soft and info coming this year!
      • Matmul bench for CUDA, CAL, and MultiCore CPUs!
      • More than 10 places where DX Compute 5.0 is better...
      • CUDA 3.0 has CUBLAS functions for MAGMA with compl...
      • About IBM OpenCL
      • OpenGL interop perf in CUDA and OCL in Linux
      • Fraps like for Linux and for Windows DX11!
      • opencl/opengl linux interop! seen in opencl cuda 3...
      • AMD OpenCl forums (I)
      • About CUDA 3.0 (II)
      • About CUDA 3.0 (I)
      • CAL 2.0 vs 1.4 API
      • Naive OpenCL benchmarks..
      • Managing AMD OpenCL GPU devices and OpenCL backend...
      • About Xvba VAAPI backend..
      • CUDA 3.0 released
      • About Khronos ICD model..
      • Exploring Nvidia OpenCL 195.39 drivers:Bugs , perf...
      • Nvidia OpenCL samples with AMD OpenCL drivers!
      • Nvidia OpenCL samples on Nvidia 195 OpenCL drivers!!
      • AMD OpenCL samples on Nvidia 195 OpenCL drivers!!
      • Optix and OpenCL SDKs with Visual Studio 2010
      • OpenCL on AMD GPUs!
      • Dreaming about Ubuntu 10.04
      • News from the web!
      • OpenCL-z is here!
      • Port of Apple demos to Windows..
      • Shared memory names..
    • ►  October (21)
Powered by Blogger.

About Me

Unknown
View my complete profile