January 2010 ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Saturday, 16 January 2010

GLES 2.0 (and 1.x) emulators..

Posted on 13:27 by Unknown

amd gles 2.0 emulator for windows
nvidia tegra 2 has gl es 2.0 emulator for windows linux?
openkode
powervr (es 2.0 for mac, win,lin) es 1.1 win and lin?
iphone emulator (gl es 2.0 and 1.x) mac
gdebugger gl es 1.x emulator es?

Posted in | No comments

OpenCL Nvidia DirectX (up to 11) extensions published..

Posted on 13:26 by Unknown

the week started with d3d11 compute shaders amd assembly checking..
now gdebugger 5.4 with OGL 3.2 and win7 support and nvidia perfsdk updates..
also ocl dx extensions by Nvidia
now I have to compare to DX OpenCL ext AMD

Posted in | No comments

Some suggestions questions and problems I have..

Posted on 13:23 by Unknown

Please fix this issues.. making and almost perfect OpenCL SDK..

This are the things that are most wished for me to be fixed:

improvements:
0. Support kernels with a loop with a lot of MADS for testing peak flops: this gets long compile times-> kernel in CUDA compiles fast..
1. Ship an up to date ICD compatible with AMD one i.e fix ICD for detecting also AMD backend.. (or AMD ship fixed OCL iCD dll)..
2. expose
clGetGLContextInfoKHR(cl_context_properties *properties,
cl_gl_context_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret)

is not in hearders, .lib and also not exported in khronos .dlls

3. Add DirectCompute ocean demo to OpenCL port in GTC09 (shown): i.e are the plans to publish OpenCL port of DirectCompute ocean demo shown in GTC OpenCL course..
4. Ship a driver compatible with new Nvidia DirectX interop extensions
5. fp_16 and 3d_image_write extensions?

ocl compiler bugs:

1. and bug in ATI AES sample.. see:

Thanks. Also, I've found a way to fix AESEncryptDecrypt sample to pass test on nvidia: just replace

CODE
unsigned char hiBitSet = (a & 0x80);
with
unsigned char hiBitSet = ((a>127)?128:0);
in AESEncryptDecrypt_Kernels.cl
It looks weird, but it works

2. fft apple lib see: http://forums.nvidia.com/index.php?showtopic=153544
Take a look at fft_base_kernels.h, see line 4 of "baseKernels", the complexMul line.
The define seems to be too complicated to the NVidia OpenCL compiler, I replaced the define by a function and it's now working:

CODE
float2 complexMul(float2 a,float2 B) { return (float2)(mad(-(a).y, (B).y, (a).x * (B).x), mad((a).y, (B).x, (a).x * (B).y));}

3. kernels without parameters don't compile

bugs in SDK:
1. samples get platformID but have to set parameter to NULL for working on non Nvidia imp (AMD imp.)..
or fix the function for setting to NULL at first..
2. Oclutils: getdevice(i) check num devices but returns wrong data if i=num devices due to incorrect check if(i>numdevices) error..
3. Shrutils: findfilepath if you put absolute path "c:\.." fails due to adding ".\" you have to add "" to add paths..

About DX11 OIT demo.. crahsing for me..
Hi have seen from AMD forums DX11 OIT demo..
Well the demo crashes with:
DXGI_ERROR_INVALID_CALL
Failed to resize swap chaing..
I have

Windows 7 x64
AMD 5850
Catalyst 9.12 hotfix and 10.1
DirectX runtime august 2009

What's the problem..

update: answer from author:

Hi rtfss!

Yes, there was a serios bug around BGRA/RGBA formats. I don't know why
it works on Windows 7 32-bit.

There fixed and slightly optimized demo demo (actually I further
optimize it as much as possible):

http://rapidshare.de/files/49006316/oit_dx11.zip.html

If it works, please let me know, and I re-upload it for public
community as soon as possible.

Some questions about VAAPI, VDPAU, XbVA?

I want to know a lot about GPU video decode stuff in Linux.. i'm asking some questions..
Basically I have doubts about using/learning VDPAU or VAAPI depending of these features:
*OpenGL interop overhead..
*Dual HD stream decode support..
*(this is your opinion) possible future support by the API of H.264 MVC (multiview codec)..

First I have read some time ago that VAAPI added GL interop so now is returning all frames as OGL textures, right?.. I think I have read that
AMD is only working trough OGL backend but has lower CPU usage but Nvidia OGL backend has some CPU overhead.. it's currently right with current drivers?..
some perf figures..
Also I tested first XBva backends with 9.10 and flgrx 9.11 and didn't work with my 5850 but ok with 4850 cards..
so say with Catalyst 9.12 hotfix or upcoming 10.1 is working Xbva backend with 5850 card?.. also assuming not is a AMD issue or a issue of AMD VAAPI backend?..
More questions:
New cards like AMD 5850, Nvidia GT 240 and Intel graphics HD support dual stream decode, so is this exposed/supported in VAAPI.. i.e. the API is capable of exposing such hardware feature?..
Assuming no, are someone working to add that support to VAAPI?..
Also VDPAU exposes and accelerates dual HD streams in supported GPUs?.. if yes it would add parity to DXVA HD
Also assuming is no NDA thing can someone tell me if using Xbva VAAPI I can decode HD dual streams?.. i.e. XBva is exposing that capabilty..

Also more "futuristic" things:
Hi have seen exists a current H.264 MVC (multiview codec) reference encoder decoder..
Also Nokia ships a encoder decoder..
I'm would want to encode some samples to MVC..
Anyway I expect VDPAU with all the Nvidia motivation in 3D Vision would add support to it sometime this year..
Someone plans to patch/improve VAAPI to expose that support? i.e. exposing VDPAU MVC support via VAAPI..
Also someone knows if FFMPEG has this support in trunk or about some effort/patches into playing these codec..

Last question is more about expectations in GPU video encoding:
Nvidia ships CUVENC.DLL for Windows providing GPU H.264 video encoding..
Now seems with Windows 7 you have crossvendor via MFT, GPU H.264 video encoding for example, I think at least supported for Nvidia..
Someone knows if Nvidia is working or VDPAU exposes currently GPU video encoding..
if not I think VAAPI latest API at least exposes the interfaces, right?.. I think it's hard to add say a backend that uses x264 or H.264 reference encoder as an example..

seems Broadcom Crystal HD has provided open source drivers for decoding all HD formats for Linux so do you plan to add a VAAPI backend for these cards?
Also seems they provide open source drivers for MAC so last question is..
how hard is to get a MAC or Windows port of VAAPI?
Basically I would want to form same source using VAAPI have GPU decoding in Windows via a DXVA VAAPI backend and for MAC at least in Snow Leopard use their GPU decoding backend..

Posted in | No comments

Why I want a tablet more than a netbook..

Posted on 05:31 by Unknown

Well I'm going to argue why for me a tablet is better than a netbook..
For starters screen size and resolution is the same i.e. you can have 1024x600 9-10 inch at least..
also there are tablets using some panel which can act as e-ink mode so
no backlighting hurting your eyes..
but at this you have less weight and less space ocupied as no keyboard, also less power-> less battery size, etc....
yeah I know you have 10-11-12 inch netbooks with 13xx-7xx pixels but I think there would be tablets with such resolution..
First with tablets you have touch support which is more comfortable than a netbook.. also in netbooks but not all..
also a lot they have all the phone gadgets: GPS, accelerometers, digital compass, Wifi, 3G, webcam, etc..
Some netbooks have some of this but it's hard to find touch+gps+3g etc.. for example..
Anyway some netbooks have tdt integrated recievers.. but you can add usb stick to a tablet and port some linux kernel support for it to arm os?..
well I don't know if tablets have usb inputs but think yes..

One area lacking was HD outputs and HD video decode/encode FUllhd support..
Another are is performance like web surfing.. i.e. CPU perf..

Also if netbooks last 8 hours more at least tablets as CPUs consumer 0.5watts
vs 2-10 watts in netbooks..
CES has been hot on mobile stuff:

This with Tegra2 has all you want.. and no I'm not Nvidia CEO..

First 2 core ARM9 (out of order) with more caches and at more than 1ghz..
some theoreticall 40-50% perf. via ARM9 CPU, 40-50% more freq per core and
2x cores so you have up to 4x perf.. assuming web browsers load things multithreaded..

There is a video on youtube showing ARM9 (dualcore?) at 500mhz via atom and web page load is similar..

Also has Tegra2 with HD 1080p video realtime decode encode.. Mp3 chip low power also and camera processor (up to 12 mp).. Tegra2 has Adobe AIR for magazines and flash 10.1 I think for being full hd videos..

Also GPU is 2x faster than Tegra

GLES 2.0 exts:

• GL_ARB_draw_buffers
• GL_ARB_half_float_pixel
• GL_EXT_packed_float
• GL_EXT_texture_array
• GL_EXT_texture_compression_latc
• GL_EXT_texture_filter_anisotropic
• GL_OES_compressed_ETC1_RGB8_texture
• GL_OES_EGL_image
• GL_OES_fbo_render_mipmap
• GL_OES_shader_binary
o (Indicated by the value of GL_NUM_SHADER_BINARY_FORMATS being nonzero).
Indicates that the implementation supports precompiled binary shaders. All of the
demos use this capability on Tegra. See the nv_shader helper library for details
• GL_OES_texture_float
• GL_OES_vertex_half_float
• GL_EXT_texture_compression_dxt1
o The implementation supports specifying textures with the
GL_COMPRESSED_RGB[A]_S3TC_DXT1_EXT formats. Not exported on Tegra, but
supported
• GL_EXT_texture_compression_s3tc
o The implementation supports specifying textures with the
GL_COMPRESSED_RGBA_S3TC_DXT[1,3,5]_EXT formats.
• GL_OES_framebuffer_object
o (Required extension.) Framebuffer objects are supported. Not exported as an extension
string on Tegra, but supported
• GL_OES_mapbuffer
o The implementation supports the glMapBufferOES and glUnmapBufferOES
functions. These are exported directly and do not need to be queried.
• GL_OES_rgb8_rgba8
o The implementation supports GL_RGBA8_OES and GL_RGB8_OES as FBO color
buffer formats.
• GL_OES_stencil8
o Indicates that the implementation can support an 8-bit stencil buffer for render targets.
Not exported as an extension string on Tegra, but supported.
• GL_OES_texture_half_float

A opengl document for tegra says has this NV nice OpenGL extensions:
NV_shader_framebuffer_fetch
NV_coverage_sample
NV_depth_nonlinear
NV_draw_path
NV_system_time

also says:
GL_NV_fbo_color_attachments
GL_NV_read_buffer

but no info on them:
GL_NV_read_buffer
seems:
NV_shader_framebuffer_fetch

Briefly:
Binary shaders
Non power of two texes..
texture arrays (dx10 feature)
(vector path rendering font rendering
reading framebuffer in shaders
antialiasing extension

Some are not supported in OpenGL 3.x and not extensions so Fermi coming ext:
specially I would want binary shaders similar d3d11 and a vector ext inside OGL for not using OpenVG and reading the current rendering FrameBuffer in shaders

Also seems Tegra 3 is coming next year (1 year tick tock as Tegra2 is Tegra 1 stuff faster?) I think it will have g9x architecture at least so at least
CUDA/OpenCL full suport with atomics, concurrent kernel exec/mem copy, async mem copies, etc.. and also OpenGL 3.2 so at least seems OGL ES 3.0 is due sometime this year to expose this graphics chips.. I hope it has OGL 3.2 plus ARB extensions as base..
Also 2x perf at lest

Some showed Unreal 3 on tegra2 similar to anand unreal in iphone's sgx535..
All gl es 2.0..

Also from PowerVR anounced sgx545 which has full OGL 3.2 and d3d 10.1 and opencl.. seems Nvidia claims similar perf due to better drivers as GPU PowerVR seems better..
it's for PSP 2 seems..
also note that Powervr from press to chip is two years or more as sgx 535 was anonunced in jan-feb 2007.. june 2009 released?
Hope it's now reduced one year the time

More CES Processors:
Laptops:
Intel Core i5 Nehalems (32nm, AES and binary field multiplies,etc..)
integrated Intel graphics HD (2x better,OpenGL 2.1 at least) but dual cores..
8 cores Nehalem Becton prototype laptop..
Quadcore AMD mobile chip Q1?

Netbooks:

Quad core ARM (ARM11? i.e. low end) by Marvell but at 1GHz at least (40nm)
Dual core ARM9 at 1.5ghz
Snapdragon 40nm dual core ARM8 at 1.5ghz
Morestown preview

Broadcom Cristal HD as hardware video decoder: source code driver for MACos and Linux
So tablet provides:
touch support
phone gadgets: gps,wifi,3g phone,integrated video cam, photo cam?,etc..
larger size than phones: allows as an e-reader, view hd videos,etc..
large on for mp3,..
high perf in tegra2

so is like iphone 3gs(phone + ipod(music))+bigger screen (for photos+videos+e-reader)+more speed CPU&GPU+full HD video encode decode

Still netbook has two advantages:
* (This is big) x86 CPU and also runs Windows, Mac and classic software..
So seems some ARM virtualization extension+ some binary translation to run
Windows via some Wine for ARM would be good..
* ION has OpenCL, CUDA, OpenGL 3.2, D3D10, etc.. (G9x)
Still lacking is ION2 which is supposed GT200 core so 3x more cores (48 cores) at least I think 24 cores is low..
so this would have more registers, less coalescing issues, local atomics, etc..
Lacking is doubles and Fermi stuff of course (DX11,etc..)..
this will be solved with Tegra3..

Of course in desktop we have:
great cpus (x86(-64),virt):soon AVX 8 cores?
great gpus: d3d11 based near 2xx gbytes/s ram
great ram (24gb kits)
great disk (tb hard drives and fast ssd)

Coming is Apple tablet:
Iphone SDK 4.0 (has webgl? and opencl?)
it's based on Snow leo or iphone os also what touch api cocoa touch ported to snow leo?
what GPU is tegra2? some news say tegra2 for 2011 iphone this is bad..
as tegra3 would be for that year (OpenCL for sure..)
also with sgx 545 anouncement this has OGL 3.x and D3d 10.1 and FULL opencl support seems is needed a GL ES 3.0 encapsulating OGL 3.x support and
iphone with that chip is good assuming powervr has good OCL support

Posted in | No comments

Thursday, 14 January 2010

More news:Found a nice blog with DirectCompute stuff..

Posted on 11:19 by Unknown

1.vreveal hd in q1 2010
2.bullet has some of sony physics effects sdk
3.freepascal opencl support
4.erlang opencl support
5.http://cudpp.googlecode.com/svn/trunk/cudpp/doc/CUDPP_slides.pdf
(last update 10/12)
In progress: Parallel reduction, more sorts, graphs, trees
remeber hash also

6.http://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-5/
now optimized..

7.regarding oit d3d11 demo
http://rapidshare.de/files/48979592/oit_dx11.zip.html
has an interesting blog:
http://joescg.blogspot.com/2010/01/compute-shader-application.html
in
http://joescg.blogspot.com

in russian
UAV vs RT
http://joescg.blogspot.com/2010/01/uav-vs-rt.html
OIT / A-buffer demo
http://joescg.blogspot.com/2010/01/oita-buffer-demo.html
Compute Shader Application
http://joescg.blogspot.com/2010/01/compute-shader-application.html
A-buffer
bullet has some of sony physics effects sdk
http://joescg.blogspot.com/2010/01/buffer.html
2010
http://joescg.blogspot.com/2009/12/oit11-dx-sdk-d3d11.html
Parallel Prefix Sum (Scan)
http://joescg.blogspot.com/2009/12/parallel-scan.html
Radeon HD 5770
http://joescg.blogspot.com/2009/12/bufferoit-through-sm-50.html
A-buffer/OIT through SM 5.0
Direct3D 11 and Text Drawing
http://joescg.blogspot.com/2009/12/direct3d-11.html
Direct3D 11 Inspection
http://joescg.blogspot.com/2009/12/direct3d-11-inspection.html
Inverse in PS
http://joescg.blogspot.com/2009/12/inverse-in-ps.html
Paper dragon
http://joescg.blogspot.com/2009/11/fermi.html
Ray-Tracing Super Sampling
http://joescg.blogspot.com/2009/11/ray-tracing-super-sampling.html
Column-major vs Row-major
http://joescg.blogspot.com/2009/10/column-major.html
more:
Ray-BV Intersection
http://joescg.blogspot.com/2009/10/ray-sphere-intersection.html
http://joescg.blogspot.com/2009/10/hd-2400-is-slow.html
Raycaster and MIP-filtering
http://joescg.blogspot.com/2009/10/raycaster-vs-rasterizer.html
http://joescg.blogspot.com/2009/10/how-to-pack-normal-into-just-2-bytes.html
http://aras-p.info/texts/CompactNormalStorage.html
http://joescg.blogspot.com/2009/12/radeon-hd-5770.html
http://joescg.blogspot.com/2009/10/woops-unit-triangle-intersection-test.html
http://joescg.blogspot.com/2009/10/gpu-ray-triangle-intersection.html
http://joescg.blogspot.com/2009/10/hybrid-approach-for-refractions.html
http://joescg.blogspot.com/2009/10/how-antialiasing-works-part-4.html
http://joescg.blogspot.com/2009/10/conservative-binary-search.html
http://joescg.blogspot.com/2009/10/fresnel-reflection.html
http://joescg.blogspot.com/2009/10/math-terms.html
http://aras-p.info/texts/CompactNormalStorage.html

http://joescg.blogspot.com/2010/01/buffer.html

Posted in | No comments

Integer GPU computing apps..

Posted on 10:25 by Unknown

Integer programs are now being routed to GPU en massse..

In one year RSA, molmud, elliptic curves ops, parts of factoring in ECM and Mersenne GIMP programs, logarithm discre problem solver,have been ported:

First see Bernstein GPU work:
(e)ecm on gpu january 09 : edward curves 48g mulmod/s on 280bit mod GTX295
called gpu-ecm
software avaiable (with source)(1-fase) chung meng cheng research page..
cuda-eecm: september 09 best optimized curves edwards on cell,cpu and gpu
now 500g mulmod/s on 192 mod (scales as pow(280/192,2)) so 6-7 times faster than previous record..
CPU imp now ported from GMP to MPFQ and better EECM usage:
GMP-ECM->EECM-MPFQ software with source avaiable at:
I think gpu soon avaiable in cpu page..
nearsha gpu and cpu client

RSA see dublin research group (also best aes imp and good mulmod on Zp or ZN)

Factor code:
Msieve 1.44 gpu download win32 binary :
with c160 gpu load 99%
On SVN source has VC2008 projects by Brian Gladman..
says 27x 9800gt vs intel core duo
examples:
9370548739750343689742077059611741296688413458087068027338328923603585147935698143105876573510157864118212297131774808193943011745511363829026508600700379919701

3414023265048252827894893895448283501597256998523545196425280040055849104721167589947328246556695586532677342768160211760950557294071424000

Mersenne programs:
Maclucasfftw_cuda (now using cufft instead of fftw) seems validated computations altough direct porting.. uses doubles so gt2xx i think and low
speed developers waiting for fermi 5x impr at least expect..
now seems gtx275 with 2048k and 4096k fft seems 2x perf over highly optimized single thread on 3ghz core 2 so at least fermi with 5x perf better than nehalem or k10 (?)

logarithm discrete solver 0.3: 0.1 in 2007 was better 16x than previous state of the art code all at x86 with 0.2 in 2008 have 64bits support and better scalability and now in spet 2009 cuda code with python interface..
the python interface is promising as has dll for cuda version so you know how to call it and has cubins.. for testing decuda..
no sources..

CPU implementations are getting faster:

GMP 5 released with better asimptotic very fast mult, div,etc.. also mingw64
support so the best probably better than before 4.3 with gladman vs2008 port using yasm and probably better than mpir as is gladman win stuff with yasm

MPIR 1.3(4) in SVN with Nehalem assembler and tuned mp_param also seems some code is very good before with fft mul,etc.. so has to test gmpbench 0.2 with mpir trunk and gmp5 x86 and 64 on windows and linux at least..

MPFQ 1.0rc2 released in october (windows support? or fixes..)

there exist MPFR and a lib using transcendentals on google code..
Also two breaktrough news:
pi world record on nehalem
768 rsa factored zimmerman stuff..

Last AMD GPU has integer sad and new integer instructions see
SA2009 course..
in parboil benchmark has sad (h.264) test would be good porting to ocl for getting sad optimized with ati ocl sad instructions what speedup vs fermi?

Posted in | No comments

MsC project ideas!

Posted on 10:22 by Unknown

This will be things I would like to work say if have to do a MsC:

1.Do a PTX to AMD IL 2.0 converter: use libptx from exoto of ptxparser of ocelot as barra uses cubin and gpgpusim not known..
then from that build a amd il codegen.. now with 5xxx specs is good stuff..
Adding also PTX 2.0 with Fermi instructions ballot etc.. also use bitinsert sad etc.. of AMD 5xxx
if you want ot execute
still lack ptx v1.5 of opencl but cuda backend perhaps supprots it as cusurf errors show opencl uses cuda runtime soemwaht
use a cudart or nvcuda library wrapper and send all that to amd ocl implemetation or better trace cal use of opencl for using cal opencl special functions and do a cal wrapper it's the best..
cubin decode with decuda..
physx,optix.etcc

2. include asm support for opencl for amd and nvidia so they map to ptx and amdil intercepting with opencl builtin get of bin
include sombe magic instruction and use asm("...") builtin function as magic
then post merge with ptx that seems ssa or do liveliness analissis over a cfg
and proper register allocation
instructions can be universial addc clock instruction for amd and nvidia and special as sad instruction,,etc....
amd is going to introduce as instrinsics
also include in cuda compiler as addc guy say nativesadamd() and intercept in ocl wrapper

3. port and redesing matmul,fft, sort and other *good* nvidia implementations to be efficient on ati

4. try to fix optimized cuda codes that no work on ati (say check implicit warp 32 size assumptions) from 3. also try to learn general rules of thumb for on the fly optimization of ptx or cuda programs to kernels..

Posted in | No comments

3d stereo: news

Posted on 10:21 by Unknown

nvidia 3d vision
firefox shows nvidia plugin installed
3d surround (3 monitors or projectors)
3d vision dx11 demo so d3d11 supported
fullhd mon laptops integrated emitter
3d bluray
3d youtube adobe

standards
bluray 33.4 gb per layer i-msle
displayport 1.2 3d at 120hz 2560x1600
hdmi 1.4 seems ps3 firmware upgrade to 1.4 for 3d comp (no higher bandwith
requires but only 720p? )
no but 1080p only for bluray (24-30fps)
wirelesshd 2.0
intel wireless protocol drivers still no 3d

glasses
new glasses for amd gpus amd driver improvements for that (behardware)
monitors

sony
62inch, led, 200hz, fullhd, integrated emitter and hdmi 1.4(?) for general

3d compatibility

now the question is fermi and 5xxx series hdmi outputs are 1.4 now or firmware upgradable and
displayport is 1.2 or firmware upgradable for 3d support for outputing to tv of sony for example
as said with play3 games bluray and possibly 3d photos can be seen..
in PC all 3d vision games photos videos.. ogl with quadros on linux also..
would be good to have a driver working in windows,linux,mac for all in fullscreen/ogl windowed also..

Web
Next3DTV
Next3D has Stereoscopic 3D Full high definition encoding and player technology for nearly every platform: PCs, Mac, Xbox 360, PS3, Blu-ray disc and television set top boxes. Content encoded with Next3D’s enabling technology delivers full stereoscopic 3D in 1080P high-definition to the home over a broadband, cable or satellite TV connection. Next3DTV will be initially available for PC’s, with support for Mac, game consoles and select set top boxes to follow. For PC users, Next3DTV will support 3D Laptops, most NetTop PCs, and modern PC’s with NVIDIA or ATI video cards.

Sony Imax alliance

Posted in | No comments

Thinking about renderants and bsgp..

Posted on 10:07 by Unknown

See renderants.org
webpage has bsgp 2.0.2 says fixes large code generation and obj issues
removes info in 2.0.1 release notes saying multigpu support, and renderants code, etc..
Also renderants 2.0.1 has renderants code..
In 2.0.2 removed renderants so you have to demand in renderants.org but has scenes of the paper but doesn't work.. in exe I have..
I have to send bugs..

Also I want to congratulate you and your students for doing BSGP seems a very good tool for easing CUDA programming..
now I have a bunch of questions being so interested in the work..
I have been lucky enough to find BSGP 2.0 released on Microsoft Research which includes Renderants source code..
I have reading the manual and seeing the release notes, I have some questions:

Says it has "Debug support (uses bsgpdbg)" it's related to the GPU debugging paper (which I have read) i.e. GPU debugging of kernels?.. etc..
in the paper it's named CUDAdb it's the same?.. also I don't seem to find how to use it.. i.e in folder bin I have the editor and compiler (ctc)..
if not present are they going to release CUDAdb?

Also says Multi-GPU support (-DmultiGPU) and I think this has to be supported on source (Renderants only?) i.e. it's not a compiler feature right?..

I also find BSGP good to use but I see some limitations..

1. It compiles directly to PTX,.. it isn't an option for "translating" to CUDA kernel files and source translating to C source..
If I say that is becase it would allow using BSGP in Windows x64 ,Linux and Mac where I have CUDA stack..
I also related see in Topcoder CUDA contest that second winner said it programmed in BSGP and translated to CUDA.. is one of BSGP workers and had such a translator?..
if not it's really easy to do by hand.. for example translate Renderants to CUDA?..
reading BSGP paper I remember to get the idea that the motivation was taking care of global synchronization points and automatically creating multiple kernels for achieving it with maximal data reuse between them..
2. Also now with OpenCL coming I see BSGP could have cross vendor supporting OpenCL insted of CUDA, is BSGP upgrading to that? since concepts are aplicable to other APIs and if not it's their interest it's really possible to get access to the code for trying to add that as a backend?..i.e are they going to post BSGP on google code for example?.. at least the core parts..

In case someone cares would be intereseting to add to a GPU bench if something does a scientific GPGPU benchmark (mainly OpenCL) for testing GPUs implementing key building blocks as:

*LU
*FFTs
*Matmul
*Sparse math
*Sorting
*Peak flops

etc..
mainly leveraging research codes published in the web... and translating to OpenCL if some wants.. for example Renderants work..
I have successfully compiled the RenderAnts binary but I'm restricted on Win 32 and a benchmark would wanted to make Win,Lin and Mac compatible..
Also having BSGP CUDA translated code would allow trying to port to OpenCL more or less..
another possibilty is using as a benchmark a raytracer in BSGP but it has the same OS limitations..

assuming I use the binary I got then I have tried to get for example Big Bunny open source Blender stuff but it don't work with Renderants with current Blender and Mosaic versions..
so it's would be really useful for someone of bsgp of sending key steps in obtaining the Elephant RIB scenes you use the paper or pointing me to some way to obtain it..

Finally if they don't have really plans on whether if release BSGP as open source or close source would be any problem with using code in it if I don't want to relase code using it?

Posted in | No comments

Thinking about direct3d 11 vs OCL..

Posted on 09:57 by Unknown

update:
another d3d11 advantage is that it has cross vendor binary shaders avoiding need to ship source code also cl has binary support but device dependent and perhaps not future proof among same devices/platforms..
so it's of no use you must ship code similar to glsl..
note that tegra2 sdk has extension (opengl es 2.0 has)
for compiling the gl shader sources to binaries in GLES 2.0.. and using it altough it has some issues with fixed state setup as alpha blending or others and you must follow rules but perhaps this would indicate Nvidia will add binary shaders to desktop gl 3.x drivers?..
A d3d11 disadvantage is multi GPU explicit management seems more intrincate.. and what about events, command buffers, etc.. also no profiling info?..
also no concept of platform and devices in d3d11 is all graphic devices and due to that multidevice management seems more complex need to test a compute shader program load balacing ati and nvidia gpu's: note this is achieved now in ATI+Nvidia GPUs in DirectCompute benchamrk in OCL mode and also smallpt 2.0 beta OCL uses CPU+GPU (multidevice) load balancer..
so good..

also found today that at least since cuda 2.3 we have:
CURTEXPORT cudaError_t CUDARTAPI cudaSetValidDevices(int *device_arr, int len);
seems similar to compute excluse mode..
also cudamemgetinfo in cuda 3.0 is new..
can't get that by other means? must see cudevicequery..
remember also ati ska 1.53 with d3d11 support which says info of kernels as alu limited,tex limited, etc..
no similar info for opencl in profiler in vs..
also ska not so info for glsl shaders.. only hlsl.

Posted in | No comments

AMD news..

Posted on 09:51 by Unknown

0. AMD 56xx series released and a lot of benchamrks
1.catalyst 10.1 final on MSI (windows xp, vista,7):
seems 8.69 instead of 8.70 beta but same cal drivers (for ocl)
ogl is 92xx based instead of 93xx so not know if ext_histogram and gl_amd_stencil_write is exposed.. but seems yes..

works in ocl 2.0?

2.DirectX 11 threading support
Catalyst 9.12 has no support for driver command lists nor concurrent creates

There are display list and command list i think one is supported other no
at least it reports so..

"Is there an ETA on full hardware acceleration for command lists and concurrent creates? Catalyst 9.12 offers just software emulation which is really disappointing ..."

It's correct, I'm going to publish a code for detecting features very soon I hope here as a comment..

3.Found some OIT info of ATI mecha vs direct3d11 sdk and a demo:

"Mecha/A-buffer implementation "
demo here!

AMD releases also GPU ShaderAnalyzer

What's New in Version 1.53

* Support for Microsoft DirectX 11.
* Support for Catalyst™ driver 9.9-9.12.
* Support for ATI Radeon™ HD 5870 graphics cards.
* Support for ATI Radeon™ HD 5770 graphics cards.
* Fixed support for IL disassembly.
* Fixed issue with simple GLSL shaders.

I have posted this on AMD OCL forums:
Questions #1: about getting peak flops on amd opencl sdk: getting ISA MADs instructions.. and max kernel length..

Hi I have written some kernels for getting near to max theoretical perf on 5xxx series (5850)

I have written codes for FP single pre, FP double prec., integer and integer 24 bit..

I write mainly kernels using OCL native mad instructions where apropiate:

mad: for floating point and for doubles

mad24: uses integer 24 bit multiplies

for integers as not exist a OpenCL imad instruction I write a*b+c

The problem lies all programs compile but I can't get mad hardware instructions used as seeing AMD IL v2 and 5xxx assembly reveals excepting single precision..

Well for double precision it crashes so I have to use a*b+c form..

Altough double prec. is experimental I hope you can add mad and fma instructions as fast as you can.. this would enable some n-body example a attack to GTC09 nbody doubles Fermi perf :-)

So briefly:

Integer mad: no exists ocl instruction i get this isa:

9 t: MULLO_INT ____, PV8.w, R0.x
10 y: ADD_INT T0.y, T0.w, PS9

Single FP: correct

MULADD_e x,w,z,y

Double precision: using native double mad or fma crashes and using a*b+c i get (il):

dmul r177.xy__, r178.xyxy, r177.xyxy
dadd r177.xy__, r177.xyxy, r178.xyxy

Integer mad24:

imul+ iadd +ishl+ ishr (at amd il but assembly is the same horribly situtation)

Note that 5850 supports MULADD_UINT24 native isa instruction

so note I can't obtain better than half theoretical ops/s in DPFP, integer and integer 24..

in fact last case is 4x slower (assuming similar time for each instruction)

One problem I see for mad24 is that amd il 2.0 seems to not expose mul24 instruction so as opencl seems to generate amd il first how this is going to be solved.. isa exists MULADD_UINT24

Also I can't believe AMD is so in that early stages for that special instruction as OpenCL and DirectCompute can use to accelerate threadid index calculations for blocks/grids less than 16m elements.. CUDA programs do it a lot..I think it's a reason that CUDPP limits some functions to 16m elements..

Also the problem with integers and general code using a*b+c instead of special mad instruction could be resolved if AMD opencl compiler understands "-cl-mad-enable"

but it says:

Warning: invalid option: -cl-mad-enable

Note I have tested kernels in Nvidia OCL using a*b+c for all suported data types and they use two instruction (mul+add) but instead if I use -cl-mad-enable it uses native hardware mad instructions..

Also one note also I put a lot of mad instructions inside a loop and AMD opencl compiler crashes and before crashing it starts to use a high time for compiling .. using some moderate length mad instructions I remember CUDA compiler eats perfectly this test..

Some argument to instruct the compiler not optimizing at all.. since a block of mad instructions can't be optimized..

Also it's a problem of compiler expanding the loop? How I can control loop expansion I think Nvidia OpenCL compiler recognizes #pragma unroll..

If i publish this code as a benchmark AMD cards will be damaged..

More questions coming..

Thanks

Questions #2: from siggraph asia course..

Hi I have seen Siggraph Asia OCL course and I can't believe here some at AMD not publish a link to it..

http://sa09.idav.ucdavis.edu/

now I have some questions:

First a good presentation from AMD is not online:

Generic OpenCL Optimizations (Jason Yang, AMD)
someone at AMD can publish presentation somewhere.. seems good to learn..

also from:

OpenCL C++ Bindings (Jayanth Gummaraju, AMD)
seems AMD has a nice OpenCL Ocean demo using FFTs..

AMD is going to publish code in SDK for users learning about a complete app using OGL interop.. or better: here and now as a gift for forum readers :-)

Also from

OpenGL Interop Examples (Timo Stich, Nvidia)
altough this is from Nvidia guy stuff :-))

Regarding OGL interop: clGetGLContextInfoKHR seems of much use..

for example you can use ocl ogl interop but I have an Nvidia and AMD card (Windows 7)

and I set for example default Nvidia monitor and OpenCL AMD ICD is loaded if I create a OGL context by default will use Nvidia OGL driver and then I create a context with OGL interop this will use AMD OCL driver and in fact AMD is joking us as it will work (so native interop (in device memory) is not working and going trough system mem).. but using the other way OGL AMD context and OCL nvidia context will return an error in clcreatecontext..

I can get info from this situations or others using:

clGetGLContextInfoKHR

the problem is that is not in .lib files and not exported in Khronos ICD DLL

but is in cl_gl.h file shipped with AMD..

Altough this is AMD forum Nvidia situation is worse..

in fact they don't ship a cl_gl.h with clGetGLContextInfoKHR definition and their Khronos older ICD don't expose it..

So when it's going to be released a SDK with clGetGLContextInfoKHR function..

Last is from:

AMD IHV Talk - Hardware and Optimizations (Jason Yang, AMD)
Questions:

how much of hardware integer instructions in slide 13 are exposed currently..

AMD is working to enable through extensions?..

I'm interested in this (as at least some of this aren't DirectCompute supported):

*Reverse bits
*Integer Add with carry

*1bit prefix sum on 64b mask. (useful for compaction)
*Shader Accessible 64 bit counter

At least in isa docs I can find info about two first but I'm interested in

1bit prefix sum on 64b mask. (useful for compaction)
how to use it?.. some cal example? more info please..

Also more info on "Shader Accessible 64 bit counter"..

what isa instruction?

Search isa docs
ALU_SRC_TIME_HI: Upper 32 bits of 64-bit clock counter.
228 ALU_SRC_TIME_LO: Lower 32 bits of 64-bit clock counter.

At least for Integer add with carry we have a CUDA enabled compiler:

http://www.mpi-inf.mpg.de/~emeliyan/cuda-compiler/

And what about dx11 based ones?: find first bit, etc..

Also as said in an earlier post: 24 bit integer MUL,MULADD
well this isn't generated altough using mad24 ocl

so I don't know if:

– Heavy use for Integer thread group address calculation
is correct in slide..

Posted in | No comments

News learned this days..

Posted on 09:50 by Unknown

First snow leo is getting ogl 3.0 for 10.6.3.. holy cow..
http://netkas.org/?p=362

Also mesa 8.0 will get OpenGL 3.x (sometime this year I hope).. see phoronix..

So seems hardware and emulated (?) OGL drivers are getting OGL 3.x (in Apple and Linux)

Second AMD releases also GPU ShaderAnalyzer

What's New in Version 1.53

* Support for Microsoft DirectX 11.
* Support for Catalyst™ driver 9.9-9.12.
* Support for ATI Radeon™ HD 5870 graphics cards.
* Support for ATI Radeon™ HD 5770 graphics cards.
* Fixed support for IL disassembly.
* Fixed issue with simple GLSL shaders.

It includes support for compute shaders? yes..

ATI Catalyst 10.1 BETA Adds New OpenGL Extensions

smallpt opencl multigpu enabled:
http://davibu.interfree.it/opencl/smallptgpu2/smallptgpu-v2.0alpha2.tgz
(found in beyon3d forums and gtx295 scales near 2x)

OpenGL Extensions Viewer 3.16

3d vision linux on lowend quadros? (follow)
http://70.87.46.147/vbulletin/showthread.php?t=143514

http://forum.beyond3d.com/showthread.php?t=56105

N-Queen Solver for OpenCL
I rewrote my old n-Queen solver from CUDA into OpenCL. It's not a port, because I rewrite everything from scratch (only the idea remains the same). I was hoping to use OpenCL for comparison between AMD and NVIDIA hardwares.

Unfortunately, for some reason, AMD's OpenCL compiler crashed when compiling my kernel for GPU (it's ok for CPU version though). So right now it doesn't work on AMD's GPU at all, but with AMD Stream SDK 2.0 it's possible to run on CPU devices.

Both source and executable are provided in the attachment.

The arguments are:

nqueen_cl -cpu -clcpu -p -thread # -platform # -local N

-cpu: use CPU implementation (single threaded)
-clcpu: use OpenCL CPU devices
-p: enable profiling
-thread #: set the number of threads. (default: max work group item number * number of devices)
-platform #: select a platform (default: select platform 0)
-local: use local memory (shared memory) for arrays. This runs faster on NVIDIA's GPU because NVIDIA GPU have no indexed registers.
N: the board size (1 ~ 32)

Note: large board size (> 20) takes forever to run (I estimate that 20 queen would take more than 2 hours to run on 9800GT).

The N-queen algorithm is a straightforward one. There is no special considerations to avoid redundant boards (i.e. most boards can be mirrored and rotated to create 8 solutions). Only a simple mirror reduction is used.

Running on my computer (Core 2 Duo E8400 3.0GHz):

16 queen -cpu: 7.27 s
16 queen -clcpu: 3.79 s

On GPU (GeForce 9800GT):

16 queen: 2.36 s
16 queen -local: 1.23 s

About the crash problem: the original kernel I've developed (i.e. the kernel used by clcpu path) doesn't crash the compiler. However, it's extremely slow (> 20 s for 16 queen on my 4850, and > 7s on 9800GT) because it uses a four arrays to simulate a stack for recursion. These arrays are good for CPU version because they reduce the amount of computation, but for GPU they are too hard on registers. So I developed another kernel which uses only one array, but it requires more computation to generate/restore data for each steps. However, this new kernel crashes the compiler (it works when selecting CPU devices, but it's slower). I'm using Cat 9.12 hotfix right now.
Attached Files
File Type: rar nqueen_cl.rar (11.4 KB, 4 views)
File Type: rar nqueen_cl_src.rar (17.9 KB, 4 views)

rumors

NVIDIA GeForce GTX 395 Specs :

- Codename "GF104".
- Dual Core GPU Design (Two GF100 "Fermi" Cores).
- 6.4 Billion Transistors In Total (TSMC 40nm Process).
- 32 Streaming Multiprocessors (SM).
- Each SM has 2x16-wide groups of Scalar ALUs (IEEE754-2008; FP32 and FP64 FMA).
- The 32 SMs Have 1536KB Shared L2 Cache.
- 1024 Stream Processors (1-way Scalar ALUs) at 1350MHz.
- 1024 ALUs In Total.
- 1024 FP32 FMA Ops/Clock.
- 512 FP64 FMA Ops/Clock.
- Single Precision (SP; FP32) FMA Rate : 2.76 Tflops.
- Double Precision (DP; FP64) FMA Rate : 1.38 Tflops.
- 256 Texture Address Units (TA).
- 256 Texture Filtering Units (TF).
- INT8 Bilinear Texel Rate : 153.6 Gtexels/s
- FP16 Bilinear Texel Rate : 76.8 Gtexels/s
- 80 Raster Operation Units (ROPs).
- ROP Rate : 48 Gpixels
- 600MHz Core.
- 640bit (2x320bit) Memory Subsystem.
- 4200 MHz Memory Clock.
- 336 GB/s Memory Bandwidth.
- 2560MB (1280MB Effective) GDDR5 Memory.
- New Cooling Design.
- High Power Consumption.

GTX395 is 60% faster than GTX380
GTX395 is 70% faster than HD5970
Release Date : May 2010, Price : 499-549 USD.

Testing OpenCL Snow Leopard in PC:
*on vmware work 7 with fusion 3 gadgets:
*hackintosh distro: Mac OS x Snow leopard Universal v3.6 (10.6.2) _Reup

Posted in | No comments

Monday, 4 January 2010

GPU Computing calendar for Feb March 10!

Posted on 01:48 by Unknown

If you follow Gpu computing news at least there has been a OpenCL update every
month since May 2009 (expecting July 2009..):
(April 2009): nvidia private beta..
May 2009: Nvidia to GPU computing reg. dev (185 based)
june: Nvidia new update fixing issues for WinXP
Aug:AMD ships OpenCL CPU imp. Snow Leopard ships with OpenCL for amd nvidia gpus.
Sep:
(11)Nvidia to GPU computing reg (190 based)
(28)Nvidia first public developer driver and OpenCL SDK. (same as 11)
Oct: Nvidia OpenCL shipping in leaked beta 195 driver.
AMD shipping GPU OpenCL in beta4.
Nov: Nvidia Public driver and OpenCL SDK (195 cuda 3.0)
Dec: AMD ships 2.0 production

DirectX is a lot less:

March 2009 (Nvidia DirectCompute SDK initially compiled with these so it supported hardware drivers..)
August (actually released in september..)
well I was expecting a new SDK for December but thats over..

So I argue I don't expect new drivers until late Feb and surely for March..
as AMD has updated every two months and half approx (early august,mid oct,end dec) so new update is march.. also because christmas non labor days..
Nvidia non expect big things until Fermi and 200 drivers..

expect a lot in March also due to Fermi and GDC 2009:
Nvidia

Fermi launch drivers (200 drivers?)
CUDA 3.1 preview?
new OpenCL ext (3d image write,fp16)
OpenCL 1.1 preview?

Fermi:
OpenGL extensions for Fermi
D3D 11 DC5.0 for Fermi
3D Vision: windowed support, youtube jps in browser,support for D3D11 games..
OpenGL Fermi extensions
Optix 1.1 fermi optimized
Physx 3.0 GDC with APEX
Cg 3.0
VDPAU with H.264 MVC (3D Bluray)

AMD
New OpenCL SDK with image support, 3d image writes, byte addresable, fixed icd,
opengl interop
opengl 5xxx ext.

Microsoft
March sdk with doubles fixed

Apple
10.6.3 final or preview to developers
opencl fixes fft lib
opencl image support for 4xxx
doubles for gt200

preview of caustic asic card perf.

Posted in | No comments

Blog 2009 posts in PDF!

Posted on 01:29 by Unknown

Get it!
in total 125 posts and 190 pages..
Expect a lot of written english mistakes..
Not bad for a blog started on Windows 7 launch day more or less (22 October 2009)..

Posted in | No comments

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Saturday, 16 January 2010

GLES 2.0 (and 1.x) emulators..

OpenCL Nvidia DirectX (up to 11) extensions published..

Some suggestions questions and problems I have..

Why I want a tablet more than a netbook..

Thursday, 14 January 2010

More news:Found a nice blog with DirectCompute stuff..

Integer GPU computing apps..

MsC project ideas!

3d stereo: news

Thinking about renderants and bsgp..

Thinking about direct3d 11 vs OCL..

AMD news..

News learned this days..

Monday, 4 January 2010

GPU Computing calendar for Feb March 10!

Blog 2009 posts in PDF!

Popular Posts

Blog Archive

About Me