GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Thursday, 25 February 2010

News 25/2!

Posted on 07:46 by Unknown
*gpu-z 0.3.9 fixes opencl ati reporting!
*The Wind Top desktop has 24inch 3D 120hz FullHD multitouch monitor! Seems the first!
Jointly with Dell u2711 you have all the things I want from monitors in just two monitors
(well dell only adds 10 bit color, 27inch and 2560x1440 res)!

The Wind Top desktop hasHD (1080p resolution) displays range from 19- to 24-inches. At the top of the list are the Wind Tops AE2420 and AE2280, 22- and 24-inch multi-touch displays respectively, equipped with processors up to Intel Core i7. The 24-inch model features a 120Hz LED display that pairs with 3D shutter glasses. (That 3D trend isn’t dying off so fast.)

Nexus and C#
Yes, you can use Parallel Nsight/Nexus to debug CUDA C kernels written in C# or other CPU languages, but Nsight doesn't directly support the C# project type yet.
So to use CUDA.NET with Nsight, you'll need to create a dummy C++ project whose 'command' in your Nexus User Properties to your C# executable.
Then do Nexus Menu Start CUDA Debugging in Visual Studio, and you should be off and running. AFAIK, you'll still need to program the actual GPU code in CUDA C.

Pages with GPU computing stuff!
see the new? http://developer.nvidia.com/object/gpucomputing.html
you have 3 guides with Fermi stuff!
In the programming guide didn't mention that GF100 is capable of simultaneous transfers of cuMemcpyDtoHAsync and cuMemcpyHtoDAsync. I've added this to my good ol' concurrent bandwidth test and will be updating that in the near future.
search concurrent bandwidth test 1.1 for Fermi!

Missing is CUDA Developer Guide for Optimus Platforms.

__global__ function parameters are passed to the device:
* via shared memory and are limited to 256 bytes on devices of compute
capability 1.x,
* via constant memory and are limited to 4 KB on devices of compute capability
2.0.

others:
http://www.directx11tutorials.com/
[JumpToDX11-11] DirectCompute
http://vsts2010.net/220
http://www.opengpu.org/bbs/archiver/


Ivan Golubev is the blog to follow for Crypto and integer ops on GPUs!
http://www.golubev.com/blog/
He says he has added bitalign AMD IL v2 for MD5 and SHA1 cracking on 5xxx GPUs has a post estimating perf of even Fermi GPUs..
search  ighashgpu 0.70 it has this support test md5 and sha1 perf:
ighashgpu.exe /h:96b13dbbc9f3bc569ddad9745f64b9cdb43ea9ae /t:sha1 /c:sd /max:7
ighashgpu.exe /h:cbe1d6d5800ec1e03a5f2a64882a0d41 /t:md5 /c:sd /max:7
In post around end January you can find also SSE code used in her program..
VS CUDA:
You should be able to implement bit rotations using the bit-align instruction introduced with Direct3D 11 and supported on both Fermi and Cypress (computes ((a:b) >> c) & 0xffffffff, where a:b is the concatenation of two 32-bit operands).
This adds nothing to the "NVIDIA vs. AMD" debate, but should provide a nice further improvement compared to the previous generation.

Maybe some other tricks are possible...
For instance both G80 and Fermi support free binary negation of operands to logic instructions (allowing NOR, NAND, NXOR, ANDN...), and Fermi supports a left shift followed by an addition as a single instruction.

Edit: also, there is always the MAD24 instruction for computations such as 5*i+1 (much faster than adds).

Benchmar reveiws has NVIDIA nTeresting: 22 February 2010!

Limitations in OpenCL
1. Can i include C inline assembly code in my openCL code?
2. Does OpenCL support addtion and subtraction with carry?
in AMD also current limitations:
Lacking Pinned mem!
uses one UAV for all allocations so max 256Mbytes usage!

Nvidia has not this two limitations no through DirectCompute!
Regarding the two OCL limitations modify CAL++ author includes in TODO list and second is assembly instruction on 5xxx so when in AMD IL author can add!
Also Nvidia trough CUDA there is a ADDC enabled compiler referenced in previous posts and also
inline assembly is unofficialy supported in CUDA!
In Nvidia OCL you can modify code PTX on the fly and add addc and feed them!


How to wait for kernel finalization without CPU usage (from Golubev blog):
CUDA create context with CU_CTX_BLOCKING_SYNC
CAL Specifically there is an undocumented feature calCtxWaitForEvent
True ATI again planted a dog - GPU kernel compiled Catalyst 9.12 are 10% slower on RV8 × 0. and somewhere in the 2-3 times slower on RV7X0. It happened due to the fact that now the ATI CAL compiler aggressively unroll !absolutely everything, so that the kernel will become the size of a few hundred KB, did not interfere in the cache ... and everything is covered

OpenCL for FreeBASIC: http://shiny3d.de/libs/fbOpenCL.zip
Remember there is also for FreePascal and Delphi!



5 Questions -- Implementing a bunch of OpenCL tools

Texture sharing
I thing you must use in OpenCL d3d interop..

http://msdn.microsoft.com/en-us/library/ee418929%28VS.85%29.aspx

ID3D10Device::OpenSharedResource
To share a resource between a Direct3D 9 device and a Direct3D 10 device the texture must have been created using the pSharedHandle argument of CreateTexture. The shared Direct3D 9 handle is then passed to OpenSharedResource in the hResource argument.

The following code illustrates the method calls involved.

sharedHandle = NULL; // must be set to NULL to create, can use a valid handle here to open in D3D9
pDevice9->CreateTexture(..., pTex2D_9, &sharedHandle);
...
pDevice10->OpenSharedResource(sharedHandle, __uuidof(ID3D10Resource), (void**)(&tempResource10));
tempResource10->QueryInterface(__uuidof(ID3D10Texture2D), (void**)(&pTex2D_10));
tempResource10->Release();
// now use pTex2D_10 with pDevice10  
     

Textures being shared from D3D9 to D3D10 have the following restrictions.

    * Textures must be 2D
    * Only 1 mip level is allowed
    * Texture must have default usage
    * Texture must be write only
    * MSAA textures are not allowed
    * Bind flags must have SHADER_RESOURCE and RENDER_TARGET set
    * Only R10G10B10A2_UNORM, R16G16B16A16_FLOAT and R8G8B8A8_UNORM formats are allowed

Interesting post: http://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers/ vlc 1.1 is using that approach I think and also MPC Home cinema it seems!
vlc 1.1 is doing that!

Final round of Tesla Compute Cluster driver testing:
*CUDA H264 GPU video encoding work through MediaCoder
*vreveal works (clean video, sharpness)
issues:
stabilization: Gray uniform colors
contrast: i get pink color
*Badaboom fails with:
.GPU 0: ATI Radeon HD 5800 Series
FATAL:There is no GPU device supporting CUDA.
(Altough there supports TCC CUDA)


Currently the global memory available is the value returned by CL_DEVICE_GLOBAL_MEM_SIZE in device query. Full physical memory is expected to be available in one of the upcoming releases.
Global buffer is 128bit aligned addresses, UAV's are byte aligned and on 5XXX series of cards you can have up to 9 UAV's per kernel. Also through UAV's you can do byte addressable writes with the UAV arena and also atomic operations. None of these can be done on the global buffer path.

Global buffer is 128bit aligned addresses, UAV's are byte aligned and on 5XXX series of cards you can have up to 9 UAV's per kernel. Also through UAV's you can do byte addressable writes with the UAV arena and also atomic operations. None of these can be done on the global buffer path.
it is easier to burst using global memory as it is an implicit 128 bit write versus an implicit 32bit write on UAV.
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Porting CUDA to OpenCL!
    Well so you want to port CUDA code to OpenCL: you are in AMD GPU competition of porting Cuda codes to opencl (see previous post) or you are ...
  • Megapost!
    Today fools{ *GTX 485 is 512 cores 3gbytes gddr5 and 850/1750 shaders.. *ati 5990 has 4 gpus in board.. *bulldozer benchmarks }end fools.. A...
  • About ATI and Nvidia drivers (OCL included)!
    Hi I have been investigating AMD and Nvidia drivers.. for 10.3 there are 3d hooks support for 120hz monitors but is d3d9 d3d10 or d3d11 enab...
  • things found in CUDA forums
    Also some CUDA news: Mandelbulb stereo angalyph -> have to port to 3D Vision http://forums.nvidia.com/index.php?showtopic=150985&st=2...
  • opencl/opengl linux interop! seen in opencl cuda 3.0 sdk samples
    Following my OpenCL/OpenGL Window interop work: now has come to Linux  for Nvidia GPU computing registered developers via 195.17 driver! Als...
  • State of the blog..
    Sorry for the delay guys of posting code of Apple OpenCL demos port.. the blog has been with no updated for more than 2 weeks in this rapid ...
  • Optix and OpenCL SDKs with Visual Studio 2010
    Optix 1.0 ========= install cg download Cmake 2.80 cmake says error dumpbin not found and it is cuda doesn't work with vc2010 so copy pt...
  • CUDA 3.0 forums stuff!
    1.Getting CUBIN instead of ELF If you need the older text format, you can disable ELF cubins in nvcc.profile by changing "CUBINS_ARE_EL...
  • News from the web!
    Some things learned in AMD forums: 1.Why 3xxx no OpenCL: Compute shader mode is a hardware feature that did not exist in the HD38XX line of ...
  • Shaders: measuring perf, source translation and parsing different languages!
    Hi, I hope to be pretty exhaustive of options for parsing and translating between graphics and compute shaders ( some open source) For DX sh...

Blog Archive

  • ►  2013 (5)
    • ►  September (1)
    • ►  March (3)
    • ►  February (1)
  • ►  2012 (1)
    • ►  December (1)
  • ▼  2010 (46)
    • ►  July (4)
    • ►  May (1)
    • ►  April (3)
    • ►  March (9)
    • ▼  February (15)
      • Reading Fermi CUDA stuff!
      • Questions about OpenCL AMD d3d9 interop!
      • News 25/2!
      • 3 new tools!
      • Ideas for porting algos to GPU:AVX SSE and MMX ports!
      • About ATI and Nvidia drivers (OCL included)!
      • Shaders: measuring perf, source translation and pa...
      • Enabling OpenCL Image support on AMD GPUs!
      • Running QT everywhere!
      • Parallel algorithms avaiable on CUDA,OCL,DC,CAL: s...
      • More news!
      • Learned from voxel rendering demo code: CUDA 3.0 h...
      • A month of news!
      • About Tesla computing driver!
      • A long report of the silence before the storm: AKA...
    • ►  January (14)
  • ►  2009 (125)
    • ►  December (51)
    • ►  November (53)
    • ►  October (21)
Powered by Blogger.

About Me

Unknown
View my complete profile