GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Saturday, 7 November 2009

About CUDA 3.0 (I)

Posted on 01:59 by Unknown
I have found this:

floating point atomic adds in Fermi
float atomicAdd(float *address, float val)

 Intriguing is a mutex also

extern void CUDARTAPI __cudaMutexOperation(int lock);
#define __cudaAtomicOperation(code) \
        __cudaMutexOperation(1);    \
        code                        \
        __cudaMutexOperation(0);

with is used to implement the atomic float add
__device_func__(float __fAtomicAdd(float *address, float val))
{
  float old;

  __cudaAtomicOperation(
    old = *address;
    *address = old + val;
  )
  return old;
}

cuda gl texture interop
==============


get device: cudaGLSetGLDevice (int device)
 create tex
register tex: cudaGraphicsGLRegisterImage
setup
while()
{
 cudaGraphicsMapResources 
get array: cudaGraphicsSubResourceGetMappedArray
use cuda kernel
cudaGraphicsUnmapResources
draw gl
}
new SOs:

Ubuntu 9.04*
Mac OS X 10.6.0*
Mac OS X 10.6.1*

New API Features
--------------------------


  - Added the BLAS1 functions:
        * cublasZaxpy()
        * cublasZcopy()   
        * cublasZswap()                     
     - Added the BLAS2 functions:
        * cublasDtrmv()
        * cublasCtrmv()
        * cublasCgemv() 
        * cublasCgeru()
        * cublasCgerc()               
        * cublasZtrmv() 
        * cublasZgemv()                              
        * cublasZgeru()
        * cublasZgerc()               
     - Added the BLAS3 functions:
        * cublasCtrsm()
        * cublasCtrmm() 
        * cublasCsyrk()
        * cublasCsymm()
        * cublasCherk()                              
        * cublasZtrsm()  
        * cublasZtrmm()               
        * cublasZsyrk()
        * cublasZsymm()
        * cublasZherk()      

  o Float16 (half) textures are supported in the runtime
    - cudaCreateChannelDescHalf family of functions supports it in C++ style
      API or proper channel could be crated via cudaCreateChannelDesc in C
      style level API
    - users should be aware that halves are promoted to floats during
      computation and therefore, only floats could be fetched by texture
      fetch functions
    - users could use intrinsics in device code to convert between fp16 and
      fp32 data       

  o Double3 and double4 vector types are supported in the runtime
    -  This breaks code when users had already added these themselves.

  o One dimensional device-device copies now support streams.
    - cudaMemcpyAsync now applies the stream parameter for
      cudaMemcpyDeviceToDevice as well
    - cuMemcpyDtoDAsync

  o Support for ELF binaries
    - ELF is generated by default by nvcc. For ptxas or fatbin, the -elf option
      is required.
    - Cubins are now binary files. Do not assume that they are ASCII text.

  o Testing applications for Fermi-readiness
    - Setting the env variable CUDA_FORCE_PTX_JIT to 1 will disable all non-PTX
      user kernels from being able to load. If your application fails to run,
      you are not compiling with PTX. Please see the programming guide for more
      information about compiling for different compute capabilities.

  o OpenGL texture interoperation

  o Batched 2D & 3D transforms are now supported in CUFFT, using the new
    cufftPlanMany() API. This is defined in cufft.h, as follows:

   cufftResult CUFFTAPI cufftPlanMany(cufftHandle *plan,
                                      int rank,
                                      int *n,
                                      int *inembed,    // Unused: pass NULL
                                      int istride,     // Unused: pass 1
                                      int idist,       // Unused: pass 0
                                      int *onembed,    // Unused: pass NULL
                                      int ostride,     // Unused: pass 1
                                      int odist,       // Unused: pass 0
                                      cufftType type,
                                      int batch);

   The arguments are:
       *plan        - The plan is returned here, as for other cufft calls
       rank         - The dimensionality of the transform (1, 2 or 3)
       *n           - An array of size [rank], describing the size of each
                      dimension
       type         - Transform type (e.g. CUFFT_C2C), as per other cufft calls
       batch        - Batch size for this transform

   Return values are as for all other cufftPlanXxx functions. Thus to plan
   a batch of 1000, 2D, double-precision, complex-to-complex transforms of
   size (128, 256), you would do:

       cufftHandle *myplan;
       cufftPlanMany(myplan, 2, { 128, 256 }, NULL, 1, 0, NULL, 1, 0, CUFFT_Z2Z, 1000);

   Note that for CUFFT 3.0, the layout of batched data must be side-by-side
   and not interleaved. The inembed, istride, idist, onembed, ostride and
   odist parameters are for enabling data windowing and interleaving in a
   future version.

New Toolkit Features
--------------------------

  o nvcc
    - The command line option --host-compilation=C is no more.  nvcc emits a
      warning and switches back to C++. This option will eventually disappear
      altogether

  o CUDA GDB known issues:
    - Please see the "Known Issues" section in the CUDA_GDB_v3.0.pdf User Manual.

  o Windows DLL Naming Conventions
    - Each DLL now specifies the machine type, the toolkit version number, and
      the build number in its filename.
    - For example, cudart32_30_4.dll would be the 32-bit build of 3.0 Cudart
      with a build number of 4.
    - The build number of the final release will always be greater than the
      build number of the beta release.
    - The corresponding .lib files do not have any extra naming decoration, so
      you can continue linking your applications the same way.

  o Separate Library for Runtime Device Emulation
    - Cudart has now been split up into two libraries. Cudartemu should be
      linked with for device emulation, similar to the way in which
      Cublasemu/Cufftemu were previously used.

--------------------------------------------------------------------------------
Bug Fixes
--------------------------------------------------------------------------------

  o The asynchronous memcpy routines require the user to pass pinned memory
    allocations for any host pointers. In Cuda 2.1, 2.2, and 2.3, no error was
    returned if you used non-pinned memory with the NULL stream in some
    Host-to-Device memcpy operations. This release adds back the appropriate
    error check and returns cudaErrorInvalidValue or CUDA_ERROR_INVALID_VALUE
    when an application uses non-pinned memory in such a transfer.

  o Both the cudaEventQuery() and cudaStreamQuery() functions have been altered
    such that they longer show first-chance exceptions when cudaErrorNotReady
    would be returned. This eliminates an issue where users could not turn on
    exception debugging in Visual Studio for applications that used these API
    calls.

= Known Issues =


CUDA Visual Profiler is non-functional on MacOS in this release.  This will be resolved in the production release.


The new cuda-memcheck utility is missing from the CUDA Toolkit packages for Linux 64-bit systems.  It is included as a separate package called cuda-memcheck_3.0beta1_linux64.tar.gz and should be installed by:

1. download unpack the archive: tar xfz  
2. copy cuda-memcheck to the location you want to be in 
3. before running set the LD_LIBRARY_PATH, 
  If you use bash:  export LD_LIBRARY_PATH=/usr/local/cuda/lib:/usr/local/cuda/lib64 


OpenCL Visual Profiler for 64-bit Windows and Linux systems is available as a separate package.  To install do the following:

  For Windows 
  1. download and unpack openclprof_3.0-beta1_win_64.zip
  2. follow steps in OpenCL_Visual_Profiler_Release_Notes.txt under 
     the top level directory.

  For Linux
  1. download and unpack openclprof_3.0-beta1_linux_64.tar.gz
  2. follow steps in OpenCL_Visual_Profiler_Release_Notes.txt under 
     the openclprof directory.


cuda-gdb does not support debugging just-in-time (JIT) compiled PTX kernels in this release.


You may notice some image corruption when using OpenGL interop in multi-GPU systems where the GPU used for computation is different than the GPU used for graphics.  This will be resolved in the production release.


The RadixSort SDK code sample does not run on Linux32.


The SamplesimpleD3D9Texture does not run on Window 32/64.



Questions i have:

Using DP, cache is hardware?
mac 64 bit?
OpenCL support using two command_queues?:
  • Multiple Copy Engine support
  • Concurrent Kernel Execution 
Also what about selecting shared mem size: in CUDA and OpenCL?
For fermi usage..
OpenCL and ECC reporting..
no updated docs

ptx 1.5 2.0 docs
 Fermi howto?
predication
use host calls
alloc mem in kernel
recursivity
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Porting CUDA to OpenCL!
    Well so you want to port CUDA code to OpenCL: you are in AMD GPU competition of porting Cuda codes to opencl (see previous post) or you are ...
  • Megapost!
    Today fools{ *GTX 485 is 512 cores 3gbytes gddr5 and 850/1750 shaders.. *ati 5990 has 4 gpus in board.. *bulldozer benchmarks }end fools.. A...
  • About ATI and Nvidia drivers (OCL included)!
    Hi I have been investigating AMD and Nvidia drivers.. for 10.3 there are 3d hooks support for 120hz monitors but is d3d9 d3d10 or d3d11 enab...
  • things found in CUDA forums
    Also some CUDA news: Mandelbulb stereo angalyph -> have to port to 3D Vision http://forums.nvidia.com/index.php?showtopic=150985&st=2...
  • opencl/opengl linux interop! seen in opencl cuda 3.0 sdk samples
    Following my OpenCL/OpenGL Window interop work: now has come to Linux  for Nvidia GPU computing registered developers via 195.17 driver! Als...
  • State of the blog..
    Sorry for the delay guys of posting code of Apple OpenCL demos port.. the blog has been with no updated for more than 2 weeks in this rapid ...
  • Optix and OpenCL SDKs with Visual Studio 2010
    Optix 1.0 ========= install cg download Cmake 2.80 cmake says error dumpbin not found and it is cuda doesn't work with vc2010 so copy pt...
  • CUDA 3.0 forums stuff!
    1.Getting CUBIN instead of ELF If you need the older text format, you can disable ELF cubins in nvcc.profile by changing "CUBINS_ARE_EL...
  • News from the web!
    Some things learned in AMD forums: 1.Why 3xxx no OpenCL: Compute shader mode is a hardware feature that did not exist in the HD38XX line of ...
  • Shaders: measuring perf, source translation and parsing different languages!
    Hi, I hope to be pretty exhaustive of options for parsing and translating between graphics and compute shaders ( some open source) For DX sh...

Blog Archive

  • ►  2013 (5)
    • ►  September (1)
    • ►  March (3)
    • ►  February (1)
  • ►  2012 (1)
    • ►  December (1)
  • ►  2010 (46)
    • ►  July (4)
    • ►  May (1)
    • ►  April (3)
    • ►  March (9)
    • ►  February (15)
    • ►  January (14)
  • ▼  2009 (125)
    • ►  December (51)
    • ▼  November (53)
      • Two big games coming today: State of the art Direc...
      • News from the web (IV) (big compilation)
      • Wishes in GPU drivers before Q2 2009!
      • CUDA Atomics perf!
      • GPU Compute benchmark results!
      • Interesting AMD Stream forums posts! (old posts)
      • Testing my apps with 8600GTS and WinXP!
      • A lot of Catalyst AMD drivers!
      • News from the web III
      • News from the web II (big compilation)
      • News from OpenCL forums!
      • Bugs in OpenGL AMD drivers: Geometry shader and te...
      • Testing LDS perf in OpenCL!
      • OpenCL bugs!
      • Benchmarking OpenCL and DirectCompute!
      • Benchmarking stientific kernels on OpenCL!
      • News from the web!
      • OpenCL learning and tutorials!
      • Porting CUDA to OpenCL!
      • GPU computing programming contests..
      • AMD 5xxx series overclocking..
      • OpenCL on Apple: update!
      • State of the blog..
      • Places where OpenCL shines!
      • Running Optix with Geforce in Linux
      • New exciting soft and info coming this year!
      • Matmul bench for CUDA, CAL, and MultiCore CPUs!
      • More than 10 places where DX Compute 5.0 is better...
      • CUDA 3.0 has CUBLAS functions for MAGMA with compl...
      • About IBM OpenCL
      • OpenGL interop perf in CUDA and OCL in Linux
      • Fraps like for Linux and for Windows DX11!
      • opencl/opengl linux interop! seen in opencl cuda 3...
      • AMD OpenCl forums (I)
      • About CUDA 3.0 (II)
      • About CUDA 3.0 (I)
      • CAL 2.0 vs 1.4 API
      • Naive OpenCL benchmarks..
      • Managing AMD OpenCL GPU devices and OpenCL backend...
      • About Xvba VAAPI backend..
      • CUDA 3.0 released
      • About Khronos ICD model..
      • Exploring Nvidia OpenCL 195.39 drivers:Bugs , perf...
      • Nvidia OpenCL samples with AMD OpenCL drivers!
      • Nvidia OpenCL samples on Nvidia 195 OpenCL drivers!!
      • AMD OpenCL samples on Nvidia 195 OpenCL drivers!!
      • Optix and OpenCL SDKs with Visual Studio 2010
      • OpenCL on AMD GPUs!
      • Dreaming about Ubuntu 10.04
      • News from the web!
      • OpenCL-z is here!
      • Port of Apple demos to Windows..
      • Shared memory names..
    • ►  October (21)
Powered by Blogger.

About Me

Unknown
View my complete profile