GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Friday, 19 February 2010

More news!

Posted on 08:08 by Unknown
I have left some news and some news:
UPDATE:
1. AMD SKA allows getting AMD IL without having AMD GPU and also see tex:alu ratio, and other info for all AMD GPUs at the same time
2. AMD SDK ships utils source so now Nvidia and AMD OCL SDKs can be compiled in VS2010!
3. gdebugger 5.5 doesn't detect amd perf counters with 10.2 I think with 9.12 hotfix worked
not working with 10.3 beta
*See next post http://oscarbg.blogspot.com/2010/02/parallel-algorithms-avaiable-on.html
*Fermi X2 on track, possible launch date is May!
*In 2-3 weeks we have 5830 (high perf low budget card) and 2GB 6 miniDP 5870 card on 11 march!
*catalyst 10.3 beta leak avaiable go search for it! (8.71.3 CAL 556)
*gpu computing gems call
*matmul by hazeman:
it's a assembly->c port similar to 1tflops mamtmul cal example it's bad it uses her own C->IR compiler but easy port OCL? and what about perf?
*bernaclejunior is doing good job regarding sort and sparse matvec on OCL,DC..
He claims near 400Mkeys/s on vs state of the art Nvidia sorting less 200mkeys on GTX285.
Also reportedly Lee Hows has fast code working!
Some intermediate code posted on XNA and AMD forums but still not the best..
*Matvec mul high perf OCL code from Bealto (AMD and Nvidia tested).
*I have tested cubin optimized matmul code and I get 480gflop/s not bad from 380gflop/s
and also I have seen tesla computing driver no supports overclokcing in evga precision..
also gpu-z and evga not read core speed and mem speed and also not gpu usage and mem info anyway
temperature and fan speed is ok..
It's very long so (tested on vc2010rc1)
change in autoprofile:
profile_sgemm_square("../method1/decuda_ldsb32_cudasm.cubin", "method1_variant_sgemmNN", &method1_DrvWrapper, cat(OUTPUT_DIR,"method1/variant_threads320.txt") );
profile_general_sgemm_square("../method6/decuda_ldsb32_cudasm.cubin", "method6_variant_sgemmNN", &method6_DrvWrapper, cat(OUTPUT_DIR,"method6/variant_threads320.txt") );
profile_general_sgemm_square("../method7/decuda_ldsb32_cudasm.cubin", "method7_variant_sgemmNN",
&method7_DrvWrapper, cat(OUTPUT_DIR,"method7/variant_threads320.txt") );
profile_sgemm_square("../method8/decuda_ldsb32_cudasm.cubin", "method8_variant_sgemmNN",
&method8_DrvWrapper, cat(OUTPUT_DIR,"method8/variant_threads256.txt") );
variants are the fastest and 1 is the best (480gflops/s). also set:
for( n1 = 32 ; n1 <= 4096 ; n1+=96) for( n1 = 5 ; n1 <= 4096 ; n1++)


result is 100x test speed in
->profile_general_sgemm_square(profile_general_sgemm_suqare.cpp,profile_sgemm_suqare.cpp)
->profile_CUBLAS_overN

* I also have tested voxel sparse demo and fixed for tcc but building 1gb samples crashes on
ball example no mem with x32 release exe but x64 crashes anyway have to fix..
Found also sibenik and Fairy scenes but I don't know how to build sibenik-d example displacament
mapped using bump map texs(?)
I have to test it..



Antialiasing in Deferred shading GL code



new GL multivendor SM5.0 info found in 10.3:
*GL_EXT_tessellation_shader
gl_TessCoord gl_TessLevelOuter gl_TessLevelInner
*GL_EXT_shader_subroutine
*GL_EXT_gpu_shader5
memoryBarrier bitCount findLSB findMSB bitfieldReverse bitfieldInsert bitfieldExtract floatBitsToInt floatBitsToUint intBitsToFloat uintBitsToFloat
*GL_EXT_gpu_shader_fp64
new in 10.3:
*GL_EXT_shader_atomic_counters
GL_MAX_ATOMIC_COUNTERS_EXT
glResetAtomicCounter
check fail: index must be a constant in atomic counter functions
gl_MaxAtomicCountersEXT
atomicCounterIncrementEXT atomicCounterDecrementEXT atomicCounterEXT
imageAtomicAdd imageAtomicSub imageAtomicMin imageAtomicMax imageAtomicIncWrap imageAtomicDecWrap imageAtomicAnd imageAtomicOr imageAtomicXor imageAtomicExchange imageAtomicCompSwap
GL_EXT_texture_compression_bptc (replaces amd extensions)
GL_AMD_conservative_depth

OpenCL:
1.pinned mem enabled on nvidia via:
I use

host_mem = clCreateBuffer(context,
CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_WRITE,
size,NULL,&ocl_err);
*ptr = (void*)clEnqueueMapBuffer(cmd_queue,host_mem,
CL_TRUE,CL_MAP_READ|CL_MAP_WRITE,
0,size,0,NULL,&evt,&ocl_err);

to create page locked memory using the NVIDIA driver, where it works fine. However on my AMD card this makes no difference to malloced memory.
AMD guys confirm still not working.
2. cvs with icd code is for Khronos members spec updated
3. new cl headers at khronos list funcaddress used by ICD.. has cl_ext.h and cl_gl_ext.h
http://www.khronos.org/registry/cl/ headers
4. OpenCL 2.01 on Ubuntu 9.10:
You indeed have to boot with "nopat" or use Catalyst 10.2, when it becomes available. CAL version >= 1.4.553 to get this working without "nopat" option.

XvBA and other linux video decoding updates:
*vaapi guy working on Crystal HD support?
For Crystal HD demos:
Crystal HD SDK from GIT, as of 2010/02/15.

*at least supported in basic samples
*xvba now working for vlc 1.1git and gnash via updates for xvba-vaapi and gnash

Status of Xbva:
Works with MPLAYER (ass subtitles included),VLC and GNASH!
issues:
1.First broken decode in 5xxx..
2.Deinterlacing is broken in XvBA. It's the second most critical bug that has to be fixed by the
end of April.
(Only bob deinterlacing at this time. More elaborated deinterlacers are not, and won't be, exposed to the public builds of xvba-video.)

Changelog:



Version 0.6.5 - 08.Feb.2010
* Add brightness/contrast/hue/saturation display attributes
* Fix vaPutSurface() window resize. e.g. when switching to full-screen mode
* Allow vaPutSurface() to render to multiple drawables from a single surface

Notes:
- My ProcAmp adjustments are probably not fully correct. e.g. hue doesn't preserve luminance yet. Besides, this uses an extra FBO.
- The last change workarounds a bug in the driver and now makes it possible to use VA-API acceleration with Gnash with the the AGG renderer. However, this exhausts another performance problem (flickering in windowed mode) of the driver. You can workaround that with XVBA_VIDEO_PUTSURFACE_FAST set to "yes" or "1". The semantics are not fully equivalent and can cause problems, hence it's disabled by default though it's designed to work with Gnash and MPlayer.
There is already native VA-API support for G45. At this time, it only does MPEG-2 VLD, i.e. full video decode. Intel is working on H.264 support and this should be available by Q2. I don't think there is any H.264 video decoding at Gallium3D level yet, so VDPAU / VA-API support would be useless at this time.

Version 0.6.6 - 11.Feb.2010
* Fix XvBA objects destruction for fglrx >= 8.70.3
* Fix vaPutImage() to a surface used for decoding
* Fix vaGetImage()/vaPutSurface() with surface dimensions not a multiple of 16
* Fix rendering of VA subpictures that were previously deassociated

The third change is actually two different workarounds for a single and major flaw in XvBA. I have not fully regression tested but this looks OK for MPlayer, Gnash and VLC. This should fix Kano problems.
The fourth change is a fix for MPlayer/VA-API with ASS support, and that I will probably upload tomorrow. I have to check against the latest Intel drivers first. NVIDIA is already fine.
With this mplayer-vaapi snapshot and xvba 0.6.6 ASS works!
Version 0.6.7 - 18.Feb.2010
* Use fail-safe values for H.264 videos encoded over HP@L4.1
* Fix hue rotation to preserve luminance
* Fix internal contrast range to [ 0.0f .. 10.0f ]
* Fix rendering of multiple subpictures per surface
* Fix vaCopySurfaceGLX() for surfaces with dimensions not a multiple of 16

- The first change ensures that we don't crash or do weird things if we throw unsupported H.264 contents to the decoder. Wel, it
tries to get things on a safer side, without really fixing it.

- The ProcAmp changes are probably still not correct but this looks better for contrast and hue rotation.

- The fourth change fixes rendering of multiple subpictures per surface. In particular, you can now have OSD + EOSD + ProcAmp
adjustment bars (3 subpictures) in MPlayer without crashing the application.

- The last change is a workaround for a serious XvBA flaw, now implemented in vaCopySurfaceGLX(). e.g. for mplayer -vo vaapi:gl -va
vaapi. As a side effect, this would also workaround another limitation in the future iteration (0.6.8) whereby only GL_BGRA textures
are supported at this time.

Mplayer vaapi
Version 2010.02.12
* Fix YV12 rendering for SW codecs
* Add EOSD support (ASS subtitles)
* Add compatibility with original VA-API 0.29
* Add support for -geometry +xxx+yyy (Adam Strzelecki)

For EOSD & AMD, you need xvba-video >= 0.6.6.
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Porting CUDA to OpenCL!
    Well so you want to port CUDA code to OpenCL: you are in AMD GPU competition of porting Cuda codes to opencl (see previous post) or you are ...
  • Megapost!
    Today fools{ *GTX 485 is 512 cores 3gbytes gddr5 and 850/1750 shaders.. *ati 5990 has 4 gpus in board.. *bulldozer benchmarks }end fools.. A...
  • About ATI and Nvidia drivers (OCL included)!
    Hi I have been investigating AMD and Nvidia drivers.. for 10.3 there are 3d hooks support for 120hz monitors but is d3d9 d3d10 or d3d11 enab...
  • things found in CUDA forums
    Also some CUDA news: Mandelbulb stereo angalyph -> have to port to 3D Vision http://forums.nvidia.com/index.php?showtopic=150985&st=2...
  • opencl/opengl linux interop! seen in opencl cuda 3.0 sdk samples
    Following my OpenCL/OpenGL Window interop work: now has come to Linux  for Nvidia GPU computing registered developers via 195.17 driver! Als...
  • State of the blog..
    Sorry for the delay guys of posting code of Apple OpenCL demos port.. the blog has been with no updated for more than 2 weeks in this rapid ...
  • Optix and OpenCL SDKs with Visual Studio 2010
    Optix 1.0 ========= install cg download Cmake 2.80 cmake says error dumpbin not found and it is cuda doesn't work with vc2010 so copy pt...
  • CUDA 3.0 forums stuff!
    1.Getting CUBIN instead of ELF If you need the older text format, you can disable ELF cubins in nvcc.profile by changing "CUBINS_ARE_EL...
  • News from the web!
    Some things learned in AMD forums: 1.Why 3xxx no OpenCL: Compute shader mode is a hardware feature that did not exist in the HD38XX line of ...
  • Shaders: measuring perf, source translation and parsing different languages!
    Hi, I hope to be pretty exhaustive of options for parsing and translating between graphics and compute shaders ( some open source) For DX sh...

Blog Archive

  • ►  2013 (5)
    • ►  September (1)
    • ►  March (3)
    • ►  February (1)
  • ►  2012 (1)
    • ►  December (1)
  • ▼  2010 (46)
    • ►  July (4)
    • ►  May (1)
    • ►  April (3)
    • ►  March (9)
    • ▼  February (15)
      • Reading Fermi CUDA stuff!
      • Questions about OpenCL AMD d3d9 interop!
      • News 25/2!
      • 3 new tools!
      • Ideas for porting algos to GPU:AVX SSE and MMX ports!
      • About ATI and Nvidia drivers (OCL included)!
      • Shaders: measuring perf, source translation and pa...
      • Enabling OpenCL Image support on AMD GPUs!
      • Running QT everywhere!
      • Parallel algorithms avaiable on CUDA,OCL,DC,CAL: s...
      • More news!
      • Learned from voxel rendering demo code: CUDA 3.0 h...
      • A month of news!
      • About Tesla computing driver!
      • A long report of the silence before the storm: AKA...
    • ►  January (14)
  • ►  2009 (125)
    • ►  December (51)
    • ►  November (53)
    • ►  October (21)
Powered by Blogger.

About Me

Unknown
View my complete profile