GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Wednesday, 4 November 2009

AMD OpenCL samples on Nvidia 195 OpenCL drivers!!

Posted on 14:42 by Unknown
It all begin when Nvidia 195 was released last week which supported OpenCL ICD and most importantly binary compatibility with AMD OpenCL implementation on this platform..
in geeks3d it was said HelloCL.exe was not working..

Let begin the investigation..

On Windows x64 and Linux (either 32 o amd64) already was working..

NOTE: (for using with Nvidia OpenCL SDK)
===============================
 Some people at Nvidia forums don't know how to do, just use libraries (.libs) and in cl_platform.h use CL_API_CALL __stdcall (see AMD header).. Also if you want OpenCL OpenGL interop use cl_gl.h from AMD SDK or the current cl_gl.h at Khronos..


First I have created simple scripts in Windows to check all the samples for error!
Download them!
(only a.bat b.bat and test.bat are needed others are for fixes Nvidia OpenCL SDK post)
extract on AMD executables directory and from shell type test.bat

This is thanks to most samples supporting -e flag to verify results.. it prints passed or error!
Also i fyou want to get rid of all other stuff use -q..
My scripts call all the executables with these flags and redirect the console output stuff to a file..

Now test this with Nvidia OpenCL implementation..

log
===
Running AESEncryptDecrypt
Failed
Running BinarySearch
Passed!
Running BinomialOption
Passed!
Running BitonicSort
Passed!
Running BlackScholes
Passed!
Running DCT
Passed!
Running DwtHaar1D
Passed!
Running EigenValue
Passed!
Running FastWalshTransform
Passed!
Running FloydWarshall
Passed!
Running HelloCL
HelloCL!
Creating a context
Running Histogram
Error: clEnqueueNDRangeKernel failed. Error code : CL_OUT_OF_RESOURCES
Running Mandelbrot
Error: clEnqueueNDRangeKernel failed. Error code : CL_OUT_OF_RESOURCES
Running MatrixMultiplication
Passed!
Running MatrixTranspose
Passed!
Running MersenneTwister
Error: clBuildProgram failed. Error code : unknown error code
Running MonteCarloAsian
Error: clBuildProgram failed. Error code : unknown error code
Running NBody
Passed!
Running PrefixSum
Passed!
Running QuasiRandomSequence
Error: clBuildProgram failed. Error code : unknown error code
Running RadixSort
Failed
Running RecursiveGaussian
Passed!
Running Reduction
Passed!
Running ScanLargeArrays
Passed!
Running SimpleConvolution
Passed!
Running SobelFilter
Error: clBuildProgram failed. Error code : unknown error code
Running Template

Input:
0 1 2 3 4 5 6 7 8 9 ..
Error: Creating Context. (clCreateContextFromType)

Running TemplateC

Input:
0 1 2 3 4 5 6 7 8 9 ..
Error: Creating Context. (clCreateContextFromType)
Error: Setting kernel argument. (output)

Output:
3452816845 ..
Error: In clReleaseKernel

if you see that a lot work.. but HelloCL no..

There errors are due to four kinds of errors:

*Samples hardcoded to use CPU backend (Nvidia has not anyone..) (Error: Creating Context. (clCreateContextFromType))
*Kernel source (.cl) doesn't compile due to bugs in implementations or cases no defined in OpenCL spec (Error: clBuildProgram failed. Error code : unknown error code)
*Not enough resources on Nvidia GPUs or OpenCL implementation.. (clEnqueueNDRangeKernel failed. Error code : CL_OUT_OF_RESOURCES)
*Computation isn't correct (Failed! not Passed!)

of this kinds:


CPU:HelloCL,Template,TemplateC
Kernel source: HelloCL,MersenneTwister,MonteCarloAsian,QuasiRandomSequence,SobelFilter
Resources:Histogram,Mandelbrot
Results fail: AESEncryptDecrypt,RadixSort,SobelFilter


Now with trivial fixes you can get..

CPU errors fixed
Kernel source: errors fixed
Resources: erros fixed
Results fail: I have no fixed them..



For fixing CPU errors change clCreateDeviceformtupye (CL_DEVICE_TYPE_CPU to GPU)

Kernel source errors:
==============

All bugs related to converting int to float not working:
a. uint4 to float4
b. finding math functions passing ints

1.hellocl

Nvidia compiler returns

:5: error: a __kernel function cannot have varargs or stdargs
__kernel void

the kernel hasn't parameters:
__kernel void main()

this is a bug likely, note no problem for practical purposes, for Nvidia
fix putting:
__kernel void hello(uint width)
now you have to associate the parameter if not then you get at executing kernel (enqueueNdrangekernel) and error: -52
related to setting parameters so add
kernel.setArg(0, 0);

before executing and the example is fixed



2. and 3. montecarlo and mersene samples

This is likely a bug in Nvidia imp
related to to converting a uint4 to float4 directly:




temp1 = ((float4)(temp[i])) * one / intMax;
we have to do this for working:
((float4)(temp[i].x,temp[i].y,temp[i].z,temp[i].w))


4. quasirandom sample:
we get
:35: error: no matching overload found for arguments of type 'int, int'
                int mask = pow(2, k);
                           ^~~
:45: error: no matching overload found for arguments of type 'int, int'
                output[global_id] = temp / pow(2, 32);

change you parameters to float putting (float)
pow((float)2,(float)k)

5. sobel
we get
:10: warning: unknown '#pragma OPENCL EXTENSION' - ignored
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable
:35: error: no matching overload found for arguments of type 'int, int'
                outputImage[x + y * width] = hypot(Gx,Gy)/2;
change to:
hypot((float)Gx,(float)Gy)/2;

Resource erros:

Histogram:
genius guys at ATI are using 32k shared mem reserved to DX11 GPUs. so no Nvidia GPU can handle them (also note that with the fix below applied to 16K doesn't work also on Nvidia so seems they can't use exactly all the mem, or my error..)
For their devices it works because the local mem support is either in CPU (main mem no problem on size), 4xxx boards (emulated  global mem) and 5xxx boards (uses native LDS mem which is 32kbytes )
it because a kernel in host reserves this shared mem:
BIN_SIZE*GROUP_SIZE*BIN_SIZE
which have these values:
#define BIN_SIZE 256
#define GROUP_SIZE 128

change that to:
#define BIN_SIZE 128
#define GROUP_SIZE 64

I thinks this in host and kernel code..

Mandelbrot

They are breaking Nvidia by subtle diferences in implementations..

they do are crazy thing the launch a group a 65536 threads in 1d with workgroups of one element..
I don't know if it's a problem of having more than maximum global group size, global group size in 1D, or
exceeding registers..

Fixed by changing resolution 256x256 to 128x128:
width = 256->128

It has local grups of 1 (vs warps de 32) it doesn't work by  grid size..

ya que 128 i local 1

Failed erros:

Hey I have been seen all the failed errors and I don't know where are the errors..
Note they are also the most complicated ones..
I suspect they are due or due to bugs in AMD code that exhibit by different hardware (scheduling diferences) or assumptions of threads exacuting in groups (warps vs wavefront).. less

I have also learned that Nvidia OpenCL imp has no problem handling workgroups of size 4 and 1 (which are not multiple of warp size 32).. I think CUDA has also support for it..
Note also that this is very inefficient.. also on AMD GPUs with waefront of 64.. in high end (see post by Micah in AMD Stream forums..)


So finally:


Failed
======
AESEncryptDecrypt
RadixSort
SobelFilter


Final results
========
Running AESEncryptDecrypt
Failed
Running BinarySearch
Passed!
Running BinomialOption
Passed!
Running BitonicSort
Passed!
Running BlackScholes
Passed!
Running DCT
Passed!
Running DwtHaar1D
Passed!
Running EigenValue
Passed!
Running FastWalshTransform
Passed!
Running FloydWarshall
Passed!
Running HelloCL
HelloCL!
Creating a context
Getting device info
Loading and compiling CL source
Running CL program
Done
Passed!
Running Histogram
Passed!
Running Mandelbrot
Passed!
Running MatrixMultiplication
Passed!
Running MatrixTranspose
Passed!
Running MersenneTwister
Passed!
Running MonteCarloAsian
Passed!
Running NBody
Passed!
Running PrefixSum
Passed!
Running QuasiRandomSequence
Passed!
Running RadixSort
Failed
Running RecursiveGaussian
Passed!
Running Reduction
Passed!
Running ScanLargeArrays
Passed!
Running SimpleConvolution
Passed!
Running SobelFilter
Failed



Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Porting CUDA to OpenCL!
    Well so you want to port CUDA code to OpenCL: you are in AMD GPU competition of porting Cuda codes to opencl (see previous post) or you are ...
  • Megapost!
    Today fools{ *GTX 485 is 512 cores 3gbytes gddr5 and 850/1750 shaders.. *ati 5990 has 4 gpus in board.. *bulldozer benchmarks }end fools.. A...
  • About ATI and Nvidia drivers (OCL included)!
    Hi I have been investigating AMD and Nvidia drivers.. for 10.3 there are 3d hooks support for 120hz monitors but is d3d9 d3d10 or d3d11 enab...
  • things found in CUDA forums
    Also some CUDA news: Mandelbulb stereo angalyph -> have to port to 3D Vision http://forums.nvidia.com/index.php?showtopic=150985&st=2...
  • opencl/opengl linux interop! seen in opencl cuda 3.0 sdk samples
    Following my OpenCL/OpenGL Window interop work: now has come to Linux  for Nvidia GPU computing registered developers via 195.17 driver! Als...
  • State of the blog..
    Sorry for the delay guys of posting code of Apple OpenCL demos port.. the blog has been with no updated for more than 2 weeks in this rapid ...
  • Optix and OpenCL SDKs with Visual Studio 2010
    Optix 1.0 ========= install cg download Cmake 2.80 cmake says error dumpbin not found and it is cuda doesn't work with vc2010 so copy pt...
  • CUDA 3.0 forums stuff!
    1.Getting CUBIN instead of ELF If you need the older text format, you can disable ELF cubins in nvcc.profile by changing "CUBINS_ARE_EL...
  • News from the web!
    Some things learned in AMD forums: 1.Why 3xxx no OpenCL: Compute shader mode is a hardware feature that did not exist in the HD38XX line of ...
  • Shaders: measuring perf, source translation and parsing different languages!
    Hi, I hope to be pretty exhaustive of options for parsing and translating between graphics and compute shaders ( some open source) For DX sh...

Blog Archive

  • ►  2013 (5)
    • ►  September (1)
    • ►  March (3)
    • ►  February (1)
  • ►  2012 (1)
    • ►  December (1)
  • ►  2010 (46)
    • ►  July (4)
    • ►  May (1)
    • ►  April (3)
    • ►  March (9)
    • ►  February (15)
    • ►  January (14)
  • ▼  2009 (125)
    • ►  December (51)
    • ▼  November (53)
      • Two big games coming today: State of the art Direc...
      • News from the web (IV) (big compilation)
      • Wishes in GPU drivers before Q2 2009!
      • CUDA Atomics perf!
      • GPU Compute benchmark results!
      • Interesting AMD Stream forums posts! (old posts)
      • Testing my apps with 8600GTS and WinXP!
      • A lot of Catalyst AMD drivers!
      • News from the web III
      • News from the web II (big compilation)
      • News from OpenCL forums!
      • Bugs in OpenGL AMD drivers: Geometry shader and te...
      • Testing LDS perf in OpenCL!
      • OpenCL bugs!
      • Benchmarking OpenCL and DirectCompute!
      • Benchmarking stientific kernels on OpenCL!
      • News from the web!
      • OpenCL learning and tutorials!
      • Porting CUDA to OpenCL!
      • GPU computing programming contests..
      • AMD 5xxx series overclocking..
      • OpenCL on Apple: update!
      • State of the blog..
      • Places where OpenCL shines!
      • Running Optix with Geforce in Linux
      • New exciting soft and info coming this year!
      • Matmul bench for CUDA, CAL, and MultiCore CPUs!
      • More than 10 places where DX Compute 5.0 is better...
      • CUDA 3.0 has CUBLAS functions for MAGMA with compl...
      • About IBM OpenCL
      • OpenGL interop perf in CUDA and OCL in Linux
      • Fraps like for Linux and for Windows DX11!
      • opencl/opengl linux interop! seen in opencl cuda 3...
      • AMD OpenCl forums (I)
      • About CUDA 3.0 (II)
      • About CUDA 3.0 (I)
      • CAL 2.0 vs 1.4 API
      • Naive OpenCL benchmarks..
      • Managing AMD OpenCL GPU devices and OpenCL backend...
      • About Xvba VAAPI backend..
      • CUDA 3.0 released
      • About Khronos ICD model..
      • Exploring Nvidia OpenCL 195.39 drivers:Bugs , perf...
      • Nvidia OpenCL samples with AMD OpenCL drivers!
      • Nvidia OpenCL samples on Nvidia 195 OpenCL drivers!!
      • AMD OpenCL samples on Nvidia 195 OpenCL drivers!!
      • Optix and OpenCL SDKs with Visual Studio 2010
      • OpenCL on AMD GPUs!
      • Dreaming about Ubuntu 10.04
      • News from the web!
      • OpenCL-z is here!
      • Port of Apple demos to Windows..
      • Shared memory names..
    • ►  October (21)
Powered by Blogger.

About Me

Unknown
View my complete profile