More news! ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

I have left some news and some news:
UPDATE:

1. AMD SKA allows getting AMD IL without having AMD GPU and also see tex:alu ratio, and other info for all AMD GPUs at the same time
2. AMD SDK ships utils source so now Nvidia and AMD OCL SDKs can be compiled in VS2010!
3. gdebugger 5.5 doesn't detect amd perf counters with 10.2 I think with 9.12 hotfix worked
not working with 10.3 beta

*See next post http://oscarbg.blogspot.com/2010/02/parallel-algorithms-avaiable-on.html
*Fermi X2 on track, possible launch date is May!
*In 2-3 weeks we have 5830 (high perf low budget card) and 2GB 6 miniDP 5870 card on 11 march!
*catalyst 10.3 beta leak avaiable go search for it! (8.71.3 CAL 556)
*gpu computing gems call
*matmul by hazeman:
it's a assembly->c port similar to 1tflops mamtmul cal example it's bad it uses her own C->IR compiler but easy port OCL? and what about perf?
*bernaclejunior is doing good job regarding sort and sparse matvec on OCL,DC..
He claims near 400Mkeys/s on vs state of the art Nvidia sorting less 200mkeys on GTX285.
Also reportedly Lee Hows has fast code working!
Some intermediate code posted on XNA and AMD forums but still not the best..
*Matvec mul high perf OCL code from Bealto (AMD and Nvidia tested).
*I have tested cubin optimized matmul code and I get 480gflop/s not bad from 380gflop/s
and also I have seen tesla computing driver no supports overclokcing in evga precision..
also gpu-z and evga not read core speed and mem speed and also not gpu usage and mem info anyway
temperature and fan speed is ok..
It's very long so (tested on vc2010rc1)
change in autoprofile:

profile_sgemm_square("../method1/decuda_ldsb32_cudasm.cubin", "method1_variant_sgemmNN", &method1_DrvWrapper, cat(OUTPUT_DIR,"method1/variant_threads320.txt") );
profile_general_sgemm_square("../method6/decuda_ldsb32_cudasm.cubin", "method6_variant_sgemmNN", &method6_DrvWrapper, cat(OUTPUT_DIR,"method6/variant_threads320.txt") );
profile_general_sgemm_square("../method7/decuda_ldsb32_cudasm.cubin", "method7_variant_sgemmNN",
&method7_DrvWrapper, cat(OUTPUT_DIR,"method7/variant_threads320.txt") );
profile_sgemm_square("../method8/decuda_ldsb32_cudasm.cubin", "method8_variant_sgemmNN",
&method8_DrvWrapper, cat(OUTPUT_DIR,"method8/variant_threads256.txt") );

variants are the fastest and 1 is the best (480gflops/s). also set:

for( n1 = 32 ; n1 <= 4096 ; n1+=96) for( n1 = 5 ; n1 <= 4096 ; n1++)

result is 100x test speed in

->profile_general_sgemm_square(profile_general_sgemm_suqare.cpp,profile_sgemm_suqare.cpp)
->profile_CUBLAS_overN

* I also have tested voxel sparse demo and fixed for tcc but building 1gb samples crashes on
ball example no mem with x32 release exe but x64 crashes anyway have to fix..
Found also sibenik and Fairy scenes but I don't know how to build sibenik-d example displacament
mapped using bump map texs(?)
I have to test it..

Antialiasing in Deferred shading GL code

new GL multivendor SM5.0 info found in 10.3:
*GL_EXT_tessellation_shader

gl_TessCoord gl_TessLevelOuter gl_TessLevelInner

*GL_EXT_shader_subroutine
*GL_EXT_gpu_shader5

memoryBarrier bitCount findLSB findMSB bitfieldReverse bitfieldInsert bitfieldExtract floatBitsToInt floatBitsToUint intBitsToFloat uintBitsToFloat

*GL_EXT_gpu_shader_fp64
new in 10.3:
*GL_EXT_shader_atomic_counters
GL_MAX_ATOMIC_COUNTERS_EXT

glResetAtomicCounter
check fail: index must be a constant in atomic counter functions
gl_MaxAtomicCountersEXT
atomicCounterIncrementEXT atomicCounterDecrementEXT atomicCounterEXT
imageAtomicAdd imageAtomicSub imageAtomicMin imageAtomicMax imageAtomicIncWrap imageAtomicDecWrap imageAtomicAnd imageAtomicOr imageAtomicXor imageAtomicExchange imageAtomicCompSwap

GL_EXT_texture_compression_bptc (replaces amd extensions)
GL_AMD_conservative_depth

OpenCL:
1.pinned mem enabled on nvidia via:

I use

host_mem = clCreateBuffer(context,
CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_WRITE,
size,NULL,&ocl_err);
*ptr = (void*)clEnqueueMapBuffer(cmd_queue,host_mem,
CL_TRUE,CL_MAP_READ|CL_MAP_WRITE,
0,size,0,NULL,&evt,&ocl_err);

to create page locked memory using the NVIDIA driver, where it works fine. However on my AMD card this makes no difference to malloced memory.

AMD guys confirm still not working.
2. cvs with icd code is for Khronos members spec updated
3. new cl headers at khronos list funcaddress used by ICD.. has cl_ext.h and cl_gl_ext.h
http://www.khronos.org/registry/cl/ headers
4. OpenCL 2.01 on Ubuntu 9.10:

You indeed have to boot with "nopat" or use Catalyst 10.2, when it becomes available. CAL version >= 1.4.553 to get this working without "nopat" option.

XvBA and other linux video decoding updates:
*vaapi guy working on Crystal HD support?
For Crystal HD demos:
Crystal HD SDK from GIT, as of 2010/02/15.

*at least supported in basic samples
*xvba now working for vlc 1.1git and gnash via updates for xvba-vaapi and gnash

Status of Xbva:
Works with MPLAYER (ass subtitles included),VLC and GNASH!
issues:
1.First broken decode in 5xxx..
2.Deinterlacing is broken in XvBA. It's the second most critical bug that has to be fixed by the
end of April.
(Only bob deinterlacing at this time. More elaborated deinterlacers are not, and won't be, exposed to the public builds of xvba-video.)

Changelog:

Version 0.6.5 - 08.Feb.2010
* Add brightness/contrast/hue/saturation display attributes
* Fix vaPutSurface() window resize. e.g. when switching to full-screen mode
* Allow vaPutSurface() to render to multiple drawables from a single surface

Notes:
- My ProcAmp adjustments are probably not fully correct. e.g. hue doesn't preserve luminance yet. Besides, this uses an extra FBO.
- The last change workarounds a bug in the driver and now makes it possible to use VA-API acceleration with Gnash with the the AGG renderer. However, this exhausts another performance problem (flickering in windowed mode) of the driver. You can workaround that with XVBA_VIDEO_PUTSURFACE_FAST set to "yes" or "1". The semantics are not fully equivalent and can cause problems, hence it's disabled by default though it's designed to work with Gnash and MPlayer.
There is already native VA-API support for G45. At this time, it only does MPEG-2 VLD, i.e. full video decode. Intel is working on H.264 support and this should be available by Q2. I don't think there is any H.264 video decoding at Gallium3D level yet, so VDPAU / VA-API support would be useless at this time.

Version 0.6.6 - 11.Feb.2010
* Fix XvBA objects destruction for fglrx >= 8.70.3
* Fix vaPutImage() to a surface used for decoding
* Fix vaGetImage()/vaPutSurface() with surface dimensions not a multiple of 16
* Fix rendering of VA subpictures that were previously deassociated

The third change is actually two different workarounds for a single and major flaw in XvBA. I have not fully regression tested but this looks OK for MPlayer, Gnash and VLC. This should fix Kano problems.
The fourth change is a fix for MPlayer/VA-API with ASS support, and that I will probably upload tomorrow. I have to check against the latest Intel drivers first. NVIDIA is already fine.

With this mplayer-vaapi snapshot and xvba 0.6.6 ASS works!

Version 0.6.7 - 18.Feb.2010
* Use fail-safe values for H.264 videos encoded over HP@L4.1
* Fix hue rotation to preserve luminance
* Fix internal contrast range to [ 0.0f .. 10.0f ]
* Fix rendering of multiple subpictures per surface
* Fix vaCopySurfaceGLX() for surfaces with dimensions not a multiple of 16

- The first change ensures that we don't crash or do weird things if we throw unsupported H.264 contents to the decoder. Wel, it
tries to get things on a safer side, without really fixing it.

- The ProcAmp changes are probably still not correct but this looks better for contrast and hue rotation.

- The fourth change fixes rendering of multiple subpictures per surface. In particular, you can now have OSD + EOSD + ProcAmp
adjustment bars (3 subpictures) in MPlayer without crashing the application.

- The last change is a workaround for a serious XvBA flaw, now implemented in vaCopySurfaceGLX(). e.g. for mplayer -vo vaapi:gl -va
vaapi. As a side effect, this would also workaround another limitation in the future iteration (0.6.8) whereby only GL_BGRA textures
are supported at this time.

Mplayer vaapi

Version 2010.02.12
* Fix YV12 rendering for SW codecs
* Add EOSD support (ASS subtitles)
* Add compatibility with original VA-API 0.29
* Add support for -geometry +xxx+yyy (Adam Strzelecki)

For EOSD & AMD, you need xvba-video >= 0.6.6.

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Friday, 19 February 2010

More news!

0 comments:

Post a Comment

Popular Posts

Blog Archive

About Me