February 2010 ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Friday, 26 February 2010

Reading Fermi CUDA stuff!

Posted on 11:30 by Unknown

Fermi comp guide:
CUBINs are only compatible forward up to major revision so 1.x cubins only work on Tesla arch not Fermi..
Nvidia 195 drivers and up support forcing JIT compilation of kernels in PTX for execution setting CUDA_FORCE_PTX_JIT, which is a way of checking for Fermi support of CUDA programs. I.e. if executable doesn't contain PTX codes will fail..
For optimizing for Fermi arch (since CUDA 3.0) is better to add explicitly code=compute_20 to code=compute_10 so it generates better code (?) also add sm_10 sm_20 for cubins.
Seems that cubin are cached by the driver so generated once and survive reboots, crashes,etc.. (where are they stored?)..
For CUDA driver API use nvcc -ptx and load using cumoduleloaddataex..
Since CUDA 2.1 compiling with arch=sm_xx (default standard (?)) evaluates to code=sm_xx (cubin) and code=compute_xx (ptx) so PTX code is inserted..

Fermi tuning guide& Programming guide 3.0:
*New graphics interop API tex interop and DX11 supports: (pags 37 cudart 63 driver api)
*interop cudart driver api(pag 72):
->allocate mem with whatever API
->if initiated context with driverapi first CUDA runtime call doesn't create context->cublas cufft work from CUDA driver api
-don't work with emulation and cuCtx{push,pop} functions..
*Use concurrent kernels:
check cudaGetDeviceProperties() concurrentKernels
and use multiple streams..

(up to 4 ONLY and from only one context? I supposed up to 16 as is the number of SMs so seems one kernel per Graphics processing cluter and not per SM) also from the same context invalidates running multiple parallel CUDA executables for extracting more perf (so no similar to use CPU cores running multiple single threaded apps. This is a shame as hardware has fast context switching but with bad coded CUDA program in parallel only aleviates overhead in switching it but no in parallel..
I suppose it's software implementation issues and fixed in software in CUDA 3.x or if not would be fixed for Fermi 2 so we can run as kernels as SMs and any number on contexts in parallel altough possibly every SM can run only one context..
*Arithmetic Instruction perf table (pag 90):
remember tesla 8 cores per sm's and good ops execute one warp in 4 clocks so 8 inst/s per SM.
Fermi is 32 cores per SM. So 16 Sms.
Note 32 bit integer is on Fermi as good as floating point so imad=mad in perf.. must see.
*All global mem and shared mem access is done per warp not half-warp as before so check all goes well.
Shared mem is expanded to 32 banks.
Now with cache global mem coalescing seems less a requirement and also shared mem is much better as only bank conflict are "when 2 o more threads request data in different words to same bank" i.e it has multiple words broadcast, etc.. So 8-bit access,16 bit , 32(always fast) 64bit (doubles) and even 96bit (was always) sequential access is good now as 32bit and also 32bits with 8bit offset for example.. well I don't know if a 8 bit offset for 32 bit words is bad or not as that would require breaking every word in two banks and don't know if that is served jointly or not but I pressume runs without bank conflicts fast!
*Similar to DC which seems to require knowing at compile time(DispatchIndirect is only for grid size?):

and OpenCL having __attribute__((reqd_work_group_size(X, Y, Z)))

The workgroup size that must be used as the local_work_size argument to clEnqueueNDRangeKernel. This allows the compiler to optimize the generated code appropriately for this kernel.

CUDA introduces __launchbounds__ to be appended to kernels for specifying min blocks per SM desired ocuppancy max workgroup size so it can optimize register (spilling) usage..

*By default (i.e. compiling source programs without change) L1 cache size will be 16kb so shared mem would be increased 3x per SM.. The function for setting is
*We know global mem was cached by hardware cache (L2) and know that there is a L1 cache at least 16kb in size.. I presumed this was used for caching global mem but turns out that L1 caching of global mem can be disabled (compile time would be better at runtime).. So what is used L1 for? local mem for register spilling for example that can't be disabled..
*a read only place from global mem (like const variables in C++) used along all threads in kernel is cached using constant cache (doesn't require __constant address space)..
*Don't use 24 int integers are slow on CUDA check at compile time with CUDA_ARCH only device code
but guide says..
*FP ops are higher precision so results can differ from Tesla
*As Fermi supports 64bit address space if passing -m64 to nvcc compiles to 64 host code and device code which is slower than 32 bits.. So if you don't need 64 address space but compile to 64 bits host (i.e. the GPUs your program would run are less than 4gb or program needs already less than so compile) compile separately kernel code from host code..
*CUDA C++:
function overloading: f(int a) f(double a)
default parameters: f(a,b=0);
namespaces: namespace nv{ int a;} namespace ati{ int a;} nv::a=2; ati::a; using namespace nv; a=3;(nv )
operator overloading uchar4 operator+() uchar4 a,b,c; c=a+b;
implicit, explicit and specialized templates: f(x) or int x; f(x) and f(x){return(2);} f ret(3);
Fermi stuff:
classes and functors.

Seems support for virtual functions is missing yet and function pointers.. but coming..
Recursion and mem allocation inside kernel still lacking and coming much later (?)..
Remember all that supported in hardware..

Search fermi new insts in b.5, b.6, b.11 (103,104)
I don't know if b.12 is new __prof_trigger which exposes 8 counters which are incremented per warp each time and can be queried by profiler.. would be good if you can read with another inst in kernel? must think..
b.14 has launchbound(pag 112) doc.
Appendix G has the architecture feature chart (G1)
LACKING documentation from the guide:
*Launching of 3D grids! (well in 102 b.4 you find griddim is dim3 type but in pag 8 2.2 you see blocks are 1d or 2d thing and well in b.13 in 111 is said grid is dim3 but .z=1) (DC 5.0 has it OpenCL model (the API) supports that also)
*Surface functions (I hope are no left for CUDA 3.1 or later as Fermi supports it and even Tesla as is used for RWTexture in DC and image writes in OpenCL driver)
*Info that Fermi allows D2H H2D simultaneous transfers via async functions (check concurrent bancwith 1.1)

Also somethings I was unaware of:
use of __restrict__ in cuda pointers and some SLI info about cuda, SLI and D3D graphics interop..

Posted in | No comments

Thursday, 25 February 2010

Questions about OpenCL AMD d3d9 interop!

Posted on 07:47 by Unknown

Which is the correct way? is API stable?
See it http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=128467&enterthread=y

Hi I have coded some example trying to see how d3d interop works..
I see up to three APIs one is
KB91 - Additional Header File Required For Preview Feature: ATI Stream SDK v2.01 Support for OpenCL™ / Microsoft® DirectX® 9 & 10 Interoperability
wich shows in cl_amd.hpp clEnqueueReleaseExternalObjects similar to clEnqueueReleaseGLObjects for gl interop..
this seems to be the correct way according to KB but I can't find clEnqueueReleaseExternalObjects with clGetExtensionFunctionAddress so I also
in cl_d3d9.h I see
clEnqueueAcquireD3D9ObjectsKHR which is another way..
this is found by clGetExtensionFunctionAddress
then for buffer interop similarly we found two functions:
clCreateFromD3D9BufferKHR
and then below is
"//
// Legacy AMD CL-D3D9 interop extension
//"
with
clCreateFromD3D9Buffer
function.
With clGetExtensionFunctionAddress I found clCreateFromD3D9BufferKHR which is the correct I think.
also I'm asking for correct texture interop(yeah I know is image support required but I'm talking having correct/stable source code here not testing)
I must use this clCreateFromD3D9TextureKHR?
For testing I create context with
cl_context_properties cps[6] =
    {
        CL_CONTEXT_PLATFORM,
        (cl_context_properties)platform,
        CL_CONTEXT_D3D9_DEVICE,
        (cl_context_properties)g_pd3dDevice,
        0
    };
with a D3d device created and then use
CL_API_ENTRY cl_mem (CL_API_CALL
*myclCreateFromD3D9TextureKHR)(
    cl_context          /* context */,
    cl_mem_flags        /* flags */,
    IDirect3DTexture9 * /* texture */,
    HANDLE              /* shared_handle */,
    UINT                /* miplevel */,
    cl_int *            /* errcode_ret */);
myclCreateFromD3D9TextureKHR=(P2)clGetExtensionFunctionAddress("clCreateFromD3D9TextureKHR")
as clCreateFromD3D9TextureKHR is defined already in header.
then calling
myclCreateFromD3D9TextureKHR(context,CL_MEM_READ_WRITE,g_inputTex,g_handle,0,&status);
where g_inputTex is created and g_handle is the last parameter returned by CreateTexture. Is this correct?
In this way I get runtime error after enabling images with "set GPU_IMAGES_SUPPORT=1"
setting the Handle shader parameter to NULL
myclCreateFromD3D9TextureKHR(context,CL_MEM_READ_WRITE,g_inputTex,NULL,0,&status);
returns an CL_INVALID_D3D_OBJECT
What's the correct API? Is that stable?
Is there any sample showing interop?
is d3d11 interop coming? as nvidia has one d3d11 extension published in khronos registry?..

Posted in | No comments

News 25/2!

Posted on 07:46 by Unknown

*gpu-z 0.3.9 fixes opencl ati reporting!
*The Wind Top desktop has 24inch 3D 120hz FullHD multitouch monitor! Seems the first!
Jointly with Dell u2711 you have all the things I want from monitors in just two monitors
(well dell only adds 10 bit color, 27inch and 2560x1440 res)!

The Wind Top desktop hasHD (1080p resolution) displays range from 19- to 24-inches. At the top of the list are the Wind Tops AE2420 and AE2280, 22- and 24-inch multi-touch displays respectively, equipped with processors up to Intel Core i7. The 24-inch model features a 120Hz LED display that pairs with 3D shutter glasses. (That 3D trend isn’t dying off so fast.)

Nexus and C#

Yes, you can use Parallel Nsight/Nexus to debug CUDA C kernels written in C# or other CPU languages, but Nsight doesn't directly support the C# project type yet.
So to use CUDA.NET with Nsight, you'll need to create a dummy C++ project whose 'command' in your Nexus User Properties to your C# executable.
Then do Nexus Menu Start CUDA Debugging in Visual Studio, and you should be off and running. AFAIK, you'll still need to program the actual GPU code in CUDA C.

Pages with GPU computing stuff!
see the new? http://developer.nvidia.com/object/gpucomputing.html
you have 3 guides with Fermi stuff!

In the programming guide didn't mention that GF100 is capable of simultaneous transfers of cuMemcpyDtoHAsync and cuMemcpyHtoDAsync. I've added this to my good ol' concurrent bandwidth test and will be updating that in the near future.

search concurrent bandwidth test 1.1 for Fermi!

Missing is CUDA Developer Guide for Optimus Platforms.

__global__ function parameters are passed to the device:
* via shared memory and are limited to 256 bytes on devices of compute
capability 1.x,
* via constant memory and are limited to 4 KB on devices of compute capability
2.0.

others:
http://www.directx11tutorials.com/
[JumpToDX11-11] DirectCompute
http://vsts2010.net/220
http://www.opengpu.org/bbs/archiver/

Ivan Golubev is the blog to follow for Crypto and integer ops on GPUs!
http://www.golubev.com/blog/
He says he has added bitalign AMD IL v2 for MD5 and SHA1 cracking on 5xxx GPUs has a post estimating perf of even Fermi GPUs..
search ighashgpu 0.70 it has this support test md5 and sha1 perf:
ighashgpu.exe /h:96b13dbbc9f3bc569ddad9745f64b9cdb43ea9ae /t:sha1 /c:sd /max:7
ighashgpu.exe /h:cbe1d6d5800ec1e03a5f2a64882a0d41 /t:md5 /c:sd /max:7
In post around end January you can find also SSE code used in her program..
VS CUDA:

You should be able to implement bit rotations using the bit-align instruction introduced with Direct3D 11 and supported on both Fermi and Cypress (computes ((a:b) >> c) & 0xffffffff, where a:b is the concatenation of two 32-bit operands).
This adds nothing to the "NVIDIA vs. AMD" debate, but should provide a nice further improvement compared to the previous generation.

Maybe some other tricks are possible...
For instance both G80 and Fermi support free binary negation of operands to logic instructions (allowing NOR, NAND, NXOR, ANDN...), and Fermi supports a left shift followed by an addition as a single instruction.

Edit: also, there is always the MAD24 instruction for computations such as 5*i+1 (much faster than adds).

Benchmar reveiws has NVIDIA nTeresting: 22 February 2010!

Limitations in OpenCL
1. Can i include C inline assembly code in my openCL code?
2. Does OpenCL support addtion and subtraction with carry?
in AMD also current limitations:
Lacking Pinned mem!
uses one UAV for all allocations so max 256Mbytes usage!

Nvidia has not this two limitations no through DirectCompute!
Regarding the two OCL limitations modify CAL++ author includes in TODO list and second is assembly instruction on 5xxx so when in AMD IL author can add!
Also Nvidia trough CUDA there is a ADDC enabled compiler referenced in previous posts and also
inline assembly is unofficialy supported in CUDA!
In Nvidia OCL you can modify code PTX on the fly and add addc and feed them!

How to wait for kernel finalization without CPU usage (from Golubev blog):
CUDA create context with CU_CTX_BLOCKING_SYNC
CAL Specifically there is an undocumented feature calCtxWaitForEvent
True ATI again planted a dog - GPU kernel compiled Catalyst 9.12 are 10% slower on RV8 × 0. and somewhere in the 2-3 times slower on RV7X0. It happened due to the fact that now the ATI CAL compiler aggressively unroll !absolutely everything, so that the kernel will become the size of a few hundred KB, did not interfere in the cache ... and everything is covered

OpenCL for FreeBASIC: http://shiny3d.de/libs/fbOpenCL.zip
Remember there is also for FreePascal and Delphi!

5 Questions -- Implementing a bunch of OpenCL tools

Texture sharing
I thing you must use in OpenCL d3d interop..

http://msdn.microsoft.com/en-us/library/ee418929%28VS.85%29.aspx

ID3D10Device::OpenSharedResource
To share a resource between a Direct3D 9 device and a Direct3D 10 device the texture must have been created using the pSharedHandle argument of CreateTexture. The shared Direct3D 9 handle is then passed to OpenSharedResource in the hResource argument.

The following code illustrates the method calls involved.

sharedHandle = NULL; // must be set to NULL to create, can use a valid handle here to open in D3D9
pDevice9->CreateTexture(..., pTex2D_9, &sharedHandle);
...
pDevice10->OpenSharedResource(sharedHandle, __uuidof(ID3D10Resource), (void**)(&tempResource10));
tempResource10->QueryInterface(__uuidof(ID3D10Texture2D), (void**)(&pTex2D_10));
tempResource10->Release();
// now use pTex2D_10 with pDevice10


Textures being shared from D3D9 to D3D10 have the following restrictions.

    * Textures must be 2D
    * Only 1 mip level is allowed
    * Texture must have default usage
    * Texture must be write only
    * MSAA textures are not allowed
    * Bind flags must have SHADER_RESOURCE and RENDER_TARGET set
    * Only R10G10B10A2_UNORM, R16G16B16A16_FLOAT and R8G8B8A8_UNORM formats are allowed

Interesting post: http://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers/ vlc 1.1 is using that approach I think and also MPC Home cinema it seems!
vlc 1.1 is doing that!

Final round of Tesla Compute Cluster driver testing:
*CUDA H264 GPU video encoding work through MediaCoder
*vreveal works (clean video, sharpness)
issues:
stabilization: Gray uniform colors
contrast: i get pink color
*Badaboom fails with:
.GPU 0: ATI Radeon HD 5800 Series
FATAL:There is no GPU device supporting CUDA.
(Altough there supports TCC CUDA)

Currently the global memory available is the value returned by CL_DEVICE_GLOBAL_MEM_SIZE in device query. Full physical memory is expected to be available in one of the upcoming releases.
Global buffer is 128bit aligned addresses, UAV's are byte aligned and on 5XXX series of cards you can have up to 9 UAV's per kernel. Also through UAV's you can do byte addressable writes with the UAV arena and also atomic operations. None of these can be done on the global buffer path.

Global buffer is 128bit aligned addresses, UAV's are byte aligned and on 5XXX series of cards you can have up to 9 UAV's per kernel. Also through UAV's you can do byte addressable writes with the UAV arena and also atomic operations. None of these can be done on the global buffer path.
it is easier to burst using global memory as it is an implicit 128 bit write versus an implicit 32bit write on UAV.

Posted in | No comments

3 new tools!

Posted on 07:39 by Unknown

3 New GPU tools!

Swan: A simple tool for porting CUDA kernels to OpenCL

What is it?
Swan is a small tool that aids the reversible conversion of existing CUDA codebases to OpenCL. It does several useful things:
Translates CUDA kernel source-code to OpenCL.
Provides a common API that abstracts both CUDA and OpenCL runtimes.
Preserves the convenience of the CUDA <<< grid, block >>> kernel launch syntax by generating C source-code for kernel entry-point functions.

Why might you want it?
Possible uses include:
Evaluating OpenCL performance of an existing CUDA code.
Maintaining a dual-target OpenCL and CUDA code.
Reducing dependence on NVCC when compiling host code.
Support multiple CUDA compute capabilities in a single binary

Limitations
It's not a drop-in replacement for nvcc. Host code needs to have all kernel invocations and CUDA API calls re-written.
Swan does not support a few things. In particular:
CUDA C++ templating in kernel code.
OpenCL Images/Samplers (analogous to Textures).
Multiple device management in a single process.
Compiling kernels for the CPU.
CUDA device-emulation mode.

Furthermore, it's a work in progress. It works for our code but no promises it will for yours

Cloo 0.6.2

A new version of Cloo is out.
It introduces a tracking mechanism for kernel arguments (sampler or memory objects) which prevents them from being claimed by the GC in case the user application doesn't refer to them in later code. This behaviour has been backported to the existing Set*Argument methods since it is safer. You can override auto-tracking using the newly added overloads.
A critical bug affecting image read operations together with some other minor glitches were fixed.
As for breaking changes rename any ComputeImage.PixelSize to ElementSize and you're good to go.

Clootils have been improved, too. Now, you can take advantage of some bells and whistles which control the program building behavior.

CAL++ v. 0.8 release
anouncement

C++ to IL generator/compiler with C++ bindings for CAL
http://sourceforge.net/projects/calpp/

The CAL++ library has been just released. Project homepage is located here http://sourceforge.net/projects/calpp/ .

The project consist of two main components. One is C++ binding for CAL ( it's really much easier to develop new CAL applications using bindings ) and second is C++ to IL generator/compiler.

The C++ generator/compiler has syntax very similar to OpenCL ( with few necessary exceptions ). Also it supports all devices which can run CAL kernels ( finally OpenCL like language for 3xxx ).

It has some advantages over OpenCL compiler. To name few

- it's much closer to CAL - it allows to write code which is almost as good ( or as good ) as handwritten IL. Look at the matrix multiplication example - it has almost the same ISA as prunedtree original code ( it differs only where I've added some changes ).

- Advantage of using C++. I really wouldn't like to use double-double ( or quad float ) technique without C++.

- Powerfull control over loop unroling and code selection ( at IL compilation time ). The C++ language acts like preprocesor.

- It has LDS support for 4xxx, doubles, etc. And if something is missing it can be added really easy.

But as always there are some pitfalls to this approach

- it isn't OpenCL . Having standard is always usefull.

- Only partial support for structs ( it can be much improved but never as good as OpenCL ).

- CAL++ is much closer to IL and some more knowledge about IL is required to achive full potential ( hmmm I think this is also the case with OpenCL ).

- optimization is only performed by CAL IL compiler ( which isn't that good ).

With the library there are some examples included. I think the fastest matrix multiplication might be a small gem here .

I hope that CAL++ will be usefull to someone .

Doesn't compile under Windows MSVC 2008!
Use 0.8a for GCC 4.4!
QA:

1. Have you tested on Windows?

No. But with the exception to C++ compiler problems it should work ( there is nothing platform specific in the code ).

2. Also have you added 24 bit integer instructions? they are useful for getting thread id fast for example..

CAL++ is converting code to IL. So 24 operations need to be available in CAL IL. And unfortunatelly it isn't the case.

I'm thinking how hard is to add also GDS?

Using anything that isn't available in IL is really hard ( or close to impossible ).

When CAL supported ISA assembler compilation ( 3xxx family ) you could generate ISA ASM. I would call it really, really hard as you need to be aware of many architecture limits ( and those informations simply aren't available ).

But for 4xxx, 5xxx family to use ISA requires to write your own driver stack ( as CAL doesn't support asm any more ) - I think it's simply impossible at the moment.

" It cannot be compiled at the time as it depends on some CAL Vector/Matrix classes which aren't available for public use." are this AMD NDA code or is your own code?

It's my own code, but it's far from being ready. For vectorquantization example is can be easily replaced by Image2D with simple functions to fill data.

are you using any magic in it? or I can code some wrappers?..

The Matrix/Vector code is using a little bit of magic . Any vector/matrix expression ( like vec_a = 3*vec_b + vec_c + log(vec_d) ) is converted to proper kernel ( trick with using templates for delayed execution ) and executed on gpu. It saves a lot of time with writing custom kernels .

From TODO:

1. Add UAVs support,logical operations and more double math functions and as_typen conversion

2. Add il_asm function ( usage example: il_asm("mov %1,%2", v1, v2); would generate "mov r1,r2" )

3. Add documentation and more examples

4. Easier to use local cal arrays, and more user friendly code for IL creation functions

Posted in | No comments

Ideas for porting algos to GPU:AVX SSE and MMX ports!

Posted on 07:09 by Unknown

Hi this can be seen as crazy but some research of year 96 can be useful in thinking what Intel thought
were heavy useful algos that could offer improved perf using SSE,MMX,AVX!

For AVX there is an AVX site containing a lot of posts:
some new are from January offering general CRC perf spee using pcmuldq on Westemere!
also some AVX report numbers using Sandy Bridge silicon!
For SSE see:
http://www.datasheetarchive.com/datasheet-pdf/1070.html
especially intel reports 802-833 here you can see
"Increasing the Accuracy of the Results from the Reciprocal and Reciprocal Square Root"
Instructions using the Newton-Raphson Method..
which in fact is redeferenced in gpu gems3 nbody

MMX manuals here:
http://www.tommesani.com/IntelAppNotes.html
http://software.intel.com/en-us/articles/mmxt-technology-manuals-and-application-notes/

Posted in | No comments

About ATI and Nvidia drivers (OCL included)!

Posted on 07:08 by Unknown

Hi
I have been investigating AMD and Nvidia drivers..

for 10.3 there are 3d hooks support for 120hz monitors but is d3d9 d3d10 or d3d11 enabled? or both? what's the API? and also what about only fullscreen support as 3d vision d3d support or windowed support? 3d vision windowed is soon..

also I have checked 10.3 and has bugs in gdebugger 5.5 (8.68 no perf counters found),gdi (still slow on aero w7),heaven ogl(issues half screen as early drivers but I think hotfix 9.12 worked fine..)

nvapi should be coming with 200 series with gpu usage apis?

first for AMD you can use some components without installing the complete driver for example
opengl drivers and ati cal drivers I think..
note this is no go in Nvidia where every component has to be from the same version..
well I don't remember opencl if it's equal since tesla computing driver not includes nvcompiler.dll
but I have found you can enable OpenCL with tesla computing driver just use opencl.dll from amd sdk 2.01 the problem it didn't work is because I have opencl.dll from older nvidia driver and installing tcc driver over it didn't remove it and amd installer doesn't overwrite (but it should!)..
to fix it del opencl.dll and reinstall or:
goto C:\ATI\Support\streamsdk_2-0-1_win764\Packages\Apps and in dev dir drop msi to
http://dl.dropbox.com/u/1416327/extractmsi.bat and search in temp3 opencl.dll..
Now I have learned some things: OpenCL works with tcc, is clever enoguh to disable cl_khr_sharing (or it's some weird issue that seems superintelligent?) and also that introduces d3d9 interop, reports icd extension and exntesions report some unroll extensions I didn't konw if where on 195 first drivers.. also
note d3d9 interop in tcc won't work so also if were intelligent would be disable on tcc..
finally I have found aes amd sample now works and some other demos that didn't work first mandelbrot I think.. I don't know if it's due to 2.01 source improvements or Nvidia improvements or both!.
Note OpenCL Nvidia is super good now with some weird issues (amd aes sample, volume 3d demo fast on w 7.. still to check is functions with no parameters as HelloCL and Apple FFT d(and ocena) emo working?) fixed and also to be ICD complaint (now amd+nvida works with 2.01 opencl.dll)and d3d9 interop (i have not checked if working) I think almost all of these are 196 ocl improvements.. no it's left for ocl is d3d10 and half support and for Fermi 3d_image_writes and d3d11 interop and better perf?
Also ocl d3d interop enables gl-dx interop?
also I have tested cuvid if working and I can't see OGL CUVID example working and I have selected preferCUDA (no default preferVP (video processor) as that return errors on init same as preferDXVA(but this doesn't worked with normal drivers)) with that I have to cuGLctxcreate to cuctxcreate and remove all cuGL functions.. should work.. but returns error in hadledecodepicture with context_invalid.. theortically should work if only CUDa cores are used..
What about win7 mft and CUVENC I think this should work as is cuda kernels..

overclock tcc doesn't work and clock reporting but with Nvidia CUDA you can get stream processor clocks at least.. anyway fan speed works..
Lastly seeing from icd spec I think all that remains for a non Khronos member to implement it are an ordered list of ocl functions in dispatch struct..

Also reported version in ccc of 2d, d3d and ogl drivers are this files:
2d->atikmdag.sys 2d version 8.01.01.1010
3d->atiumdag.dll atiumd64 d3d 735
ogl->atiogl.. 9606

All files found in 10.3 beta (new is for post 9.12 files i.e. not found in 9.12 due to crossfire restructuring almost for sure all inclde):
ati2edxx.dl_-> ati external device utility syswow64
ati2erec.dl_

AtiEDUGetThermalApiVersion
AtiEDUEnumApiSupportedDevices
AtiEDUEnumSupportedExternalDevices
AtiEDUGetExtDeviceInfo
AtiEDUOpenAdapterHandle
AtiEDUCloseAdapterHandle
AtiEDUInitializeThermal
AtiEDUSetThermalRemoteTemperatureOffset
AtiEDUSetThermalRemoteTemperatureHighSetPoint
AtiEDUSetThermalRemoteTemperatureLowSetPoint
AtiEDUSetThermalRemoteTemperatureCriticalSetPoint
AtiEDUGetThermalRemoteTemperatureOffset
AtiEDUGetThermalRemoteTemperatureHighSetPoint
AtiEDUGetThermalRemoteTemperatureLowSetPoint
AtiEDUGetThermalRemoteTemperatureCriticalSetPoint
AtiEDUGetThermalRemoteTemperature
AtiEDUThermalEnableInterrupt
AtiEDUThermalDisableInterrupt
AtiEDUGetAdapterTemperatureOffset
AtiEDUGetThermalRemoteTemperatureFP

ati adl sdk
atiadlxx.dl_ system32
atiadlxy.dl_ syswow64
ADL_Workstation_Stereo_Get
ADL_Workstation_Stereo_Set

ADL_Workstation_AdapterNumOfGLSyncConnectors_Get
ADL_Workstation_Caps
ADL_Workstation_DisplayGLSyncMode_Get
ADL_Workstation_DisplayGLSyncMode_Set
ADL_Workstation_DisplayGenlockCapable_Get
ADL_Workstation_GLSyncCounters_Get
ADL_Workstation_GLSyncGenlockConfiguration_Get
ADL_Workstation_GLSyncGenlockConfiguration_Set
ADL_Workstation_GLSyncModuleDetect_Get
ADL_Workstation_GLSyncModuleInfo_Get
ADL_Workstation_GLSyncPortState_Get
ADL_Workstation_GLSyncPortState_Set
ADL_Workstation_LoadBalancing_Caps
ADL_Workstation_LoadBalancing_Get
ADL_Workstation_LoadBalancing_Set
ADL_Workstation_Stereo_Get
ADL_Workstation_Stereo_Set

aplication profiles
atiapfxx.blb system32
atiapfxx.ex_ system32

no se (old)
atibtmon.ex_ system32 ati brigthnes monitor

ati cal
aticalcl.dl_
aticalcl64.dl_
aticaldd.dl_ OK
aticaldd64.dl_
aticalrt.dl_
aticalrt64.dl_

crossfire (new)
aticfx32.dl_ ati radeon d3d11 driver syswow64
aticfx64.dl_ ati radeon d3d11 driver system32

old
atidemgx.dll graphics demsystem32 (catalyst control center)
atidxx32.dl_ d3d 11 driver syswow64
atidxx64.dl_ d3d 11 driver system32
atieclxx.ex_ AMD external events client module (ccc) system32
atiedu64.dl_ ati external device utility system32 ati2edxx.dl
atiesrxx.ex_ AMD external events client module (ccc) system32

no se (new)
atig6pxx.dl_ powerxpress vista ogl (thunk) driver system32
atig6txx.dl_ powerxpress vista ogl driver syswow64
atigktxx.dl_ powerxpress vista ogl (thunk) driver syswow64
atiglpxx.dl_ powerxpress vista ogl driver system32

old
atiicdxx.da_
atikmdag.sy_
atikmpag.sy_
atimpc32.dl_ radeon pcom universal driver syswow64
atimpc64.dl_ radeon pcom universal driver sys32
atimuixx.dl_ multilanguage dppe dll
atio6axx.dl_ ati opengl driver system32
atiodcli.ex_ no se
atiode.ex_ no se

ogl driver syswow64
atiogl.xml
atioglxx.dl_

new
atipblag.dat contiains list (3DMark06*.exe 3DMark2001.exe 3DMark2001SE.exe 3DMark03.exe 3DMark05.exe ..)

atipdl64.dl_ --
atipdlxx.dl_ ati desktop cwddedi syswow64 old adl lib
atitmm64.dl_ tmm clone control module

new
atitmp64.dl_
atiu9p64.dl_ -
atiu9pag.dl_ powerxpress vista user mode driver (d3d9?) syswow64

old
atiumd64.dl_ readeon directx universl driver system32
atiumd6a.ca_ dat64
atiumd6a.dl_ video acceleratrion universal driver
atiumdag.dl_ readeon directx universl driver syswow64
atiumdva.ca_ dat32
atiumdva.dl_ video acceleratrion universal driver syswow64

new
atiuxp64.dl_ -
atiuxpag.dl_ powerxpress vista user mode driver (d3d10?) syswow64

I see amdpcom32 radeon pcom universal driver syswow64

ati2erec.dl_ atitmp64.dl_
atikmdag.sys ati radeon kernel model driver
atipmdag.sys ati radeon kernel model driver
atikmpag.sys mini port driver

Also found is some functions of GL_EXT_direct_state_access extensions are found (is this useful?)
this is a good extensions so is good to know..
seems GL_ARB_compatibilty is not found..

GL_EXT_direct_state_access: MISSING
---------------------------
glBindMultiTextureEXT: OK
glCheckNamedFramebufferStatusEXT: OK
glClientAttribDefaultEXT: OK
glCompressedMultiTexImage1DEXT: OK
glCompressedMultiTexImage2DEXT: OK
glCompressedMultiTexImage3DEXT: OK
glCompressedMultiTexSubImage1DEXT: OK
glCompressedMultiTexSubImage2DEXT: OK
glCompressedMultiTexSubImage3DEXT: OK
glCompressedTextureImage1DEXT: OK
glCompressedTextureImage2DEXT: OK
glCompressedTextureImage3DEXT: OK
glCompressedTextureSubImage1DEXT: OK
glCompressedTextureSubImage2DEXT: OK
glCompressedTextureSubImage3DEXT: OK
glCopyMultiTexImage1DEXT: OK
glCopyMultiTexImage2DEXT: OK
glCopyMultiTexSubImage1DEXT: OK
glCopyMultiTexSubImage2DEXT: OK
glCopyMultiTexSubImage3DEXT: OK
glCopyTextureImage1DEXT: OK
glCopyTextureImage2DEXT: OK
glCopyTextureSubImage1DEXT: OK
glCopyTextureSubImage2DEXT: OK
glCopyTextureSubImage3DEXT: OK
glDisableClientStateIndexedEXT: OK
glDisableClientStateiEXT: MISSING
glDisableVertexArrayAttribEXT: MISSING
glDisableVertexArrayEXT: MISSING
glEnableClientStateIndexedEXT: OK
glEnableClientStateiEXT: MISSING
glEnableVertexArrayAttribEXT: MISSING
glEnableVertexArrayEXT: MISSING
glFlushMappedNamedBufferRangeEXT: MISSING
glFramebufferDrawBufferEXT: OK
glFramebufferDrawBuffersEXT: OK
glFramebufferReadBufferEXT: OK
glGenerateMultiTexMipmapEXT: OK
glGenerateTextureMipmapEXT: OK
glGetCompressedMultiTexImageEXT: OK
glGetCompressedTextureImageEXT: OK
glGetDoubleIndexedvEXT: OK
glGetDoublei_vEXT: MISSING
glGetFloatIndexedvEXT: OK
glGetFloati_vEXT: MISSING
glGetFramebufferParameterivEXT: OK
glGetMultiTexEnvfvEXT: OK
glGetMultiTexEnvivEXT: OK
glGetMultiTexGendvEXT: OK
glGetMultiTexGenfvEXT: OK
glGetMultiTexGenivEXT: OK
glGetMultiTexImageEXT: OK
glGetMultiTexLevelParameterfvEXT: OK
glGetMultiTexLevelParameterivEXT: OK
glGetMultiTexParameterIivEXT: OK
glGetMultiTexParameterIuivEXT: OK
glGetMultiTexParameterfvEXT: OK
glGetMultiTexParameterivEXT: OK
glGetNamedBufferParameterivEXT: OK
glGetNamedBufferPointervEXT: OK
glGetNamedBufferSubDataEXT: OK
glGetNamedFramebufferAttachmentParameterivEXT: OK
glGetNamedProgramLocalParameterIivEXT: MISSING
glGetNamedProgramLocalParameterIuivEXT: MISSING
glGetNamedProgramLocalParameterdvEXT: OK
glGetNamedProgramLocalParameterfvEXT: OK
glGetNamedProgramStringEXT: OK
glGetNamedProgramivEXT: OK
glGetNamedRenderbufferParameterivEXT: OK
glGetPointerIndexedvEXT: OK
glGetPointeri_vEXT: MISSING
glGetTextureImageEXT: OK
glGetTextureLevelParameterfvEXT: OK
glGetTextureLevelParameterivEXT: OK
glGetTextureParameterIivEXT: OK
glGetTextureParameterIuivEXT: OK
glGetTextureParameterfvEXT: OK
glGetTextureParameterivEXT: OK
glGetVertexArrayIntegeri_vEXT: MISSING
glGetVertexArrayIntegervEXT: MISSING
glGetVertexArrayPointeri_vEXT: MISSING
glGetVertexArrayPointervEXT: MISSING
glMapNamedBufferEXT: OK
glMapNamedBufferRangeEXT: MISSING
glMatrixFrustumEXT: OK
glMatrixLoadIdentityEXT: OK
glMatrixLoadTransposedEXT: OK
glMatrixLoadTransposefEXT: OK
glMatrixLoaddEXT: OK
glMatrixLoadfEXT: OK
glMatrixMultTransposedEXT: OK
glMatrixMultTransposefEXT: OK
glMatrixMultdEXT: OK
glMatrixMultfEXT: OK
glMatrixOrthoEXT: OK
glMatrixPopEXT: OK
glMatrixPushEXT: OK
glMatrixRotatedEXT: OK
glMatrixRotatefEXT: OK
glMatrixScaledEXT: OK
glMatrixScalefEXT: OK
glMatrixTranslatedEXT: OK
glMatrixTranslatefEXT: OK
glMultiTexBufferEXT: OK
glMultiTexCoordPointerEXT: OK
glMultiTexEnvfEXT: OK
glMultiTexEnvfvEXT: OK
glMultiTexEnviEXT: OK
glMultiTexEnvivEXT: OK
glMultiTexGendEXT: OK
glMultiTexGendvEXT: OK
glMultiTexGenfEXT: OK
glMultiTexGenfvEXT: OK
glMultiTexGeniEXT: OK
glMultiTexGenivEXT: OK
glMultiTexImage1DEXT: OK
glMultiTexImage2DEXT: OK
glMultiTexImage3DEXT: OK
glMultiTexParameterIivEXT: OK
glMultiTexParameterIuivEXT: OK
glMultiTexParameterfEXT: OK
glMultiTexParameterfvEXT: OK
glMultiTexParameteriEXT: OK
glMultiTexParameterivEXT: OK
glMultiTexRenderbufferEXT: OK
glMultiTexSubImage1DEXT: OK
glMultiTexSubImage2DEXT: OK
glMultiTexSubImage3DEXT: OK
glNamedBufferDataEXT: OK
glNamedBufferSubDataEXT: OK
glNamedCopyBufferSubDataEXT: MISSING
glNamedFramebufferRenderbufferEXT: OK
glNamedFramebufferTexture1DEXT: OK
glNamedFramebufferTexture2DEXT: OK
glNamedFramebufferTexture3DEXT: OK
glNamedFramebufferTextureEXT: OK
glNamedFramebufferTextureFaceEXT: OK
glNamedFramebufferTextureLayerEXT: OK
glNamedProgramLocalParameter4dEXT: OK
glNamedProgramLocalParameter4dvEXT: OK
glNamedProgramLocalParameter4fEXT: OK
glNamedProgramLocalParameter4fvEXT: OK
glNamedProgramLocalParameterI4iEXT: MISSING
glNamedProgramLocalParameterI4ivEXT: MISSING
glNamedProgramLocalParameterI4uiEXT: MISSING
glNamedProgramLocalParameterI4uivEXT: MISSING
glNamedProgramLocalParameters4fvEXT: OK
glNamedProgramLocalParametersI4ivEXT: MISSING
glNamedProgramLocalParametersI4uivEXT: MISSING
glNamedProgramStringEXT: OK
glNamedRenderbufferStorageEXT: OK
glNamedRenderbufferStorageMultisampleCoverageEXT: MISSING
glNamedRenderbufferStorageMultisampleEXT: OK
glProgramUniform1fEXT: OK
glProgramUniform1fvEXT: OK
glProgramUniform1iEXT: OK
glProgramUniform1ivEXT: OK
glProgramUniform1uiEXT: OK
glProgramUniform1uivEXT: OK
glProgramUniform2fEXT: OK
glProgramUniform2fvEXT: OK
glProgramUniform2iEXT: OK
glProgramUniform2ivEXT: OK
glProgramUniform2uiEXT: OK
glProgramUniform2uivEXT: OK
glProgramUniform3fEXT: OK
glProgramUniform3fvEXT: OK
glProgramUniform3iEXT: OK
glProgramUniform3ivEXT: OK
glProgramUniform3uiEXT: OK
glProgramUniform3uivEXT: OK
glProgramUniform4fEXT: OK
glProgramUniform4fvEXT: OK
glProgramUniform4iEXT: OK
glProgramUniform4ivEXT: OK
glProgramUniform4uiEXT: OK
glProgramUniform4uivEXT: OK
glProgramUniformMatrix2fvEXT: OK
glProgramUniformMatrix2x3fvEXT: OK
glProgramUniformMatrix2x4fvEXT: OK
glProgramUniformMatrix3fvEXT: OK
glProgramUniformMatrix3x2fvEXT: OK
glProgramUniformMatrix3x4fvEXT: OK
glProgramUniformMatrix4fvEXT: OK
glProgramUniformMatrix4x2fvEXT: OK
glProgramUniformMatrix4x3fvEXT: OK
glPushClientAttribDefaultEXT: OK
glTextureBufferEXT: OK
glTextureImage1DEXT: OK
glTextureImage2DEXT: OK
glTextureImage3DEXT: OK
glTextureParameterIivEXT: OK
glTextureParameterIuivEXT: OK
glTextureParameterfEXT: OK
glTextureParameterfvEXT: OK
glTextureParameteriEXT: OK
glTextureParameterivEXT: OK
glTextureRenderbufferEXT: OK
glTextureSubImage1DEXT: OK
glTextureSubImage2DEXT: OK
glTextureSubImage3DEXT: OK
glUnmapNamedBufferEXT: OK
glVertexArrayColorOffsetEXT: MISSING
glVertexArrayEdgeFlagOffsetEXT: MISSING
glVertexArrayFogCoordOffsetEXT: MISSING
glVertexArrayIndexOffsetEXT: MISSING
glVertexArrayMultiTexCoordOffsetEXT: MISSING
glVertexArrayNormalOffsetEXT: MISSING
glVertexArraySecondaryColorOffsetEXT: MISSING
glVertexArrayTexCoordOffsetEXT: MISSING
glVertexArrayVertexAttribIOffsetEXT: MISSING
glVertexArrayVertexAttribOffsetEXT: MISSING
glVertexArrayVertexOffsetEXT: MISSING
I don't know if posted but error printing for ocl errors is in oclutils.cpp nvida sdk
// Helper function to get error string
// *********************************************************************
const char* oclErrorString(cl_int error)
{
   static char errorString[][64] = {
   "CL_SUCCESS",
   "CL_DEVICE_NOT_FOUND",
   "CL_DEVICE_NOT_AVAILABLE",
   "CL_COMPILER_NOT_AVAILABLE",
   "CL_MEM_OBJECT_ALLOCATION_FAILURE",
   "CL_OUT_OF_RESOURCES",
   "CL_OUT_OF_HOST_MEMORY",
   "CL_PROFILING_INFO_NOT_AVAILABLE",
   "CL_MEM_COPY_OVERLAP",
   "CL_IMAGE_FORMAT_MISMATCH",
   "CL_IMAGE_FORMAT_NOT_SUPPORTED",
   "CL_BUILD_PROGRAM_FAILURE",
   "CL_MAP_FAILURE",
   "",
   "",
   "",
   "",
   "",
   "",
   "",
   "",
   "",
   "",
   "",
   "",
   "",
   "",
   "",
   "",
   "",
   "CL_INVALID_VALUE",
   "CL_INVALID_DEVICE_TYPE",
   "CL_INVALID_PLATFORM",
   "CL_INVALID_DEVICE",
   "CL_INVALID_CONTEXT",
   "CL_INVALID_QUEUE_PROPERTIES",
   "CL_INVALID_COMMAND_QUEUE",
   "CL_INVALID_HOST_PTR",
   "CL_INVALID_MEM_OBJECT",
   "CL_INVALID_IMAGE_FORMAT_DESCRIPTOR",
   "CL_INVALID_IMAGE_SIZE",
   "CL_INVALID_SAMPLER",
   "CL_INVALID_BINARY",
   "CL_INVALID_BUILD_OPTIONS",
   "CL_INVALID_PROGRAM",
   "CL_INVALID_PROGRAM_EXECUTABLE",
   "CL_INVALID_KERNEL_NAME",
   "CL_INVALID_KERNEL_DEFINITION",
   "CL_INVALID_KERNEL",
   "CL_INVALID_ARG_INDEX",
   "CL_INVALID_ARG_VALUE",
   "CL_INVALID_ARG_SIZE",
   "CL_INVALID_KERNEL_ARGS",
   "CL_INVALID_WORK_DIMENSION",
   "CL_INVALID_WORK_GROUP_SIZE",
   "CL_INVALID_WORK_ITEM_SIZE",
   "CL_INVALID_GLOBAL_OFFSET",
   "CL_INVALID_EVENT_WAIT_LIST",
   "CL_INVALID_EVENT",
   "CL_INVALID_OPERATION",
   "CL_INVALID_GL_OBJECT",
   "CL_INVALID_BUFFER_SIZE",
   "CL_INVALID_MIP_LEVEL",
   "CL_INVALID_GLOBAL_WORK_SIZE",
   };
   return errorString[-error];
}
also ogl qbf stereo is not enabled seeing glwinfo
seeing ogl driver depends on adl but not workstatinon_setstereo get stereo o caps functions used..
also

set OGL_FORCE_ASIC_ID=37956
set OGL_FORCE_ASIC_ID=68BE
set OGL_FORCE_ASIC_ID=0x68BE
set OGL_FORCE_ASIC_ID=26814

not seen
set OGL_FORCE_ASIC_ID=9444
"ATI FirePro V8750 (FireGL)" = ati2mtag_R7XGL, PCI\VEN_1002&DEV_9444
mine is
"ATI Radeon HD 5800 Series " = ati2mtag_Evergreen, PCI\VEN_1002&DEV_6899
set OGL_ENABLE_FORCE_ASIC_ID=1

tested
glewinfo for gpu name and
visualinfo.exe

installing 8.68.3 firepro driver has 30bit support and stereo:
seen

HKR,, DisableOGL10BitPixelFormats, %REG_DWORD%, 0
HKR,, Gxo30BppPanels, %REG_BINARY%, 15,C3,76,17,15,C3,78,17

in installation diff versus cataluyst

HKR,, DALNonStandardModesBCD1, %REG_BINARY%,12,80,07,68,00,00,00,00,12,80,09,60,00,00,00,00,16,00,12,00,00,00,00,70,17,92,13,44,00,00,00,00,18,00,14,40,00,00,00,00,18,56,13,92,00,00,00,00

HKR,, DALRULE_AllowNativeModeAsDefaultModes, %REG_DWORD%, 1

GCORULE_ExtTMDSReduceBlankTiming, %REG_DWORD%, 1

"ATI FireGL V3600" = ati2mtag_RV630GL, PCI\VEN_1002&DEV_958D

"ATI FireGL V5600" = ati2mtag_RV630GL, PCI\VEN_1002&DEV_958C

"ATI FireGL V7600" = ati2mtag_R600GL, PCI\VEN_1002&DEV_940F

"ATI FireGL V7700" = ati2mtag_RV630GL, PCI\VEN_1002&DEV_9511

"ATI FireGL V8600" = ati2mtag_R600GL, PCI\VEN_1002&DEV_940B

"ATI FireGL V8650" = ati2mtag_R600GL, PCI\VEN_1002&DEV_940A

"ATI FirePro 2260" = ati2mtag_RV610, PCI\VEN_1002&DEV_95CF

"ATI FirePro 2260 " = ati2mtag_RV610, PCI\VEN_1002&DEV_95CE

"ATI FirePro 2450" = ati2mtag_RV610, PCI\VEN_1002&DEV_95CD

"ATI FirePro V3700 (FireGL)" = ati2mtag_RV620GL, PCI\VEN_1002&DEV_95CC

"ATI FirePro V3750 (FireGL)" = ati2mtag_R7XGL, PCI\VEN_1002&DEV_949F

"ATI FirePro V5700 (FireGL)" = ati2mtag_R7XGL, PCI\VEN_1002&DEV_949E

"ATI FirePro V7750 (FireGL)" = ati2mtag_R7XGL, PCI\VEN_1002&DEV_949C

"ATI FirePro V8700 (FireGL)" = ati2mtag_R7XGL, PCI\VEN_1002&DEV_9456

"ATI FirePro V8750 (FireGL)" = ati2mtag_R7XGL, PCI\VEN_1002&DEV_9444

[ati2mtag_R6xGL_SoftwareDeviceSettings]

HKR,, OGL_Specific_NA, %REG_SZ%, 1

HKR,, CatalystAI_NA, %REG_SZ%, 1

HKR,, APISpecific_NA, %REG_SZ%, 1

HKR,, TemporalAAMultiplier_NA, %REG_SZ%, 1

HKR,, Main3D_NA, %REG_SZ%, 1

HKR,, VPURecover_NA, %REG_SZ%, 1

HKR,, SmartGart_NA, %REG_SZ%, 1

Posted in | No comments

Shaders: measuring perf, source translation and parsing different languages!

Posted on 07:07 by Unknown

Hi,
I hope to be pretty exhaustive of options for parsing and translating between graphics and compute shaders ( some open source)

For DX shaders:
*GPU Shader analyzer (AMD ONLY)(get DX IL and get AMD IL and GPU assembly from DirectCompute shaders or graphic shaders): now is dx11 compatible and 5xxx series and has compute domain and hull shaders..
*fxc and D3DCompile API and lib: get DX IL and bytecode from DX shaders (multivendor)
note I don't know how to build from DX IL as D3DCompile doesn't accept and also not CreateComputeShader.. finally GPU SA doesn't want to eat too..
it's a fxc option or API for going from DX IL to DX BC so I can optimize DX IL and then compile to DX ByteCode and feed that to a compute shader?
at least if you have source you can see AMD IL, DX IL and R800 assembly and

teoretically you can get AMD IL from Compute shaders using GPU SA and feed into OpenCL when it support getting binaries and building from it (or intercepting now llc or something like that).. so in OpenCL you can modify generated assembly soon (on Nvidia now..)
so you can compare quailty of generated code..
Also now you can feed DX shaders to GPU SA and a equivalent OCL shader through SKA and compare AMD IL, assembly and even all the info of ALU/tex kernels/s etc..
I have tested simple vectoradd and quality is the same (kernels/s) altough AMD IL from OCL seems much longer..
Parsing HLSL: you have Nvidia CG compiler source so CG=HLSL in 99% so you have parser and front end compiler code.. (I think it has some flex bison things)

There was a AMD HLSL which was extension to HLSL having scatter doubles etc..
http://coachk.cs.ucf.edu/courses/CDA6938/s08/UCF-2008-02-01b.pdf
http://coachk.cs.ucf.edu/courses/CDA6938/s08/AMD_IL.pdf

Now included in compute shader 5.0 and pixel shader 5.0 in DX11 all functionality also included in upcoming GLSL ext_gpu_shader5 I presume (I can't find AMD HLSL compiler anywhere so I think efforts migrated to Brook+ efforts and AMD IL):

Note for Nvidia there is a tool similar to GPU SA but I think it's payed (ShaderPerf, perfkit can't I think..)
Have to see if Nexus will have PTX code from shaders or anything like that..
Also DX11 support is missing naturally in all tools (Perfkit, shaderperf,etc..)

HLSL<->GLSL source to source translation:
hlsl2glsl-v0.9 (OpenGL ES also source code)
babelshader

Only pixel vertex shaders..

GLSL
GPU Shader analyzer (get AMD IL from DirectCompute shaders or graphic shaders)
you can use HLSL->GLSL and using GPU SKA compare quality of generated GLSL vs HLSL AMD IL or assembly code..

Parsing GLSL:
you have flex and bison almost from spec (tokens and grammar)..

3d labs glsl validate and front end compiler open source..
(i can't find)

hlsl2glsl-v0.9..

Brook+ is open source
has AMD IL code gen source and brook parsing (ctool based)

Another thing is measuring perf of not shaders of whole thing with OpenGL gdebugger, mac OpenGL perf libs, AMD GPU 2.1, Nvidia Nexus and GL perf API and libs(perfkit sdk),
For having similar to OpenCL CAL lib see CAL++.
For porting CUDA to OpenCL there are to guides from Nvidia and AMD and:
Experiences porting from CUDA to OpenCL
Presentation at the Daresbury Machine Evaluation Workshop, 2009
also a tool:
Swan: A simple tool for porting CUDA kernels to OpenCL
A good OpenCL to DirectCompute driver wuold be good!

Posted in | No comments

Friday, 19 February 2010

Enabling OpenCL Image support on AMD GPUs!

Posted on 12:20 by Unknown

Well I have been holding this trick on my head for over a month now..
More info on my blog coming soon: oscarbg.blogspot.com
Really you can enable image support set:
set GPU_IMAGES_SUPPORT=1
or export GPU_IMAGES_SUPPORT=1 in linux
tested on 5870 and amd stream 2.0 and hotfix 9.12 only works for
2d images..
Similarly you can enable byte_addresable_support (but seems is not using RAW UAVs) and some Nvidia samples work (histogram64) with GPU_BYTE_ADDRESSABLE_STORE
Also doules extension reporting
GPU_DOUBLE_PRECISION
and gl
CL_KHR_GL_SHARING

Posted in | No comments

Running QT everywhere!

Posted on 12:13 by Unknown

TODO: post links for every thing..
I have just found a lot of platform running QT!
Last QT 4.6.2 ships with win32 bin, mac(32,64) and linux(32,64)!
You can build for win64 but is long and qt 4.6 win64 binaries ara avaiable on google code since today!
If you use VS install latest qt vs ide 1.1.4
you can build also with qt creator 1.3.1
For mobiles:
you have symbian and maemo and now meebo (moblin+maemo)
Also I have found tegra2 board working in qt blog post! (android? windows ce? linux?)
Also you have a google nacl port (for Chrome browser or IE via frame) in qt labs blog!
A port to kindle amazon is online also!
And in MWC has been shown working with remaining mobile GPUs:
*omap4(sgx 540)
*st u8500 (mali gpu)
which jointly with
tegra2 (nvidia gpu)
show is everywhere..
Well for Android you have a QT port also:
you need custom NDK with STL port included if you want..

Why I'll choose Qt GUI and not Android one?
1. The speed, Qt is more powerful and it's much more faster.
2. The features, just look at http://doc.trolltech.com/4.6/qtgui.html.
3. Declarative UI.
4. The API is very robust and stable.
5. IMHO Qt is written in a superior language. I don't like java :P. I
think if you'll ask java about me it will give you the same answer :P.
(Ok here I'm jocking).
6. etc.

only left is ipod ipad but in progress:
http://www.qt-iphone.com/Roadmap.html
currently QtCore mostly done, QtGUI hard as cocoa touch!=coca
also would be good to have all QT multitouch support and Mobility APIs just anounced as Location +Sensors+Camera API..
but this is easier said than done
then I can programm for QT for everything..

Posted in | No comments

Parallel algorithms avaiable on CUDA,OCL,DC,CAL: status update

Posted on 08:37 by Unknown

lin alg status update:
Matmul:
CUDA: CUBLAS (no code) Volkov (code) and yesterday post (assembly fastest to date 480 gflops)
CAL: beyond3d cal 1tflop matmul post
OCL: hazeman post above uses port of cal code to propietary but similar to CL code..
DC: bernaclejunior testing with doubles doesn't worked (XNA forums)
Matvec:
CUDA: CUBLAS (closed) and some papers use custom code (magma, paper mid 2008) (as 20-50% faster)
OCL: Bealto post above (high efficient on AMD and ATI) should be easy to port DC
Sparse matvec:
CUDA: CNC,CUSP,etc..
OCL,DC: BernacleJunior post on AMD and XNA forums (working on it)..

FFT:
CUDA: CUFFT 2 papers at SC08 having higher perf 3d ftts and 2d paper ->d3dCx
DC: has lib
OCL:
Apple code is 2x-3x slower than CUFFT seems (on Nvidia Linux )(also 10.6.2 is slow go see 10.6.3..)
on AMD doesn't work for size >512^2 in 2.0 or 2.01 fixed internally seems..
AMD 2.01 sample is hard coded 1024 perf?

Sort:
CUDA: CUDPP, CUDA sample (code)
OCL,DC: BernacleJunior post on AMD and XNA forums.He claims near 400Mkeys/s on vs state of the art Nvidia sorting less 200mkeys on GTX285.
Also reportedly Lee Hows has fast code working!

also CUDPP has triangular solvers and soon graph algos and hashes..

Posted in | No comments

More news!

Posted on 08:08 by Unknown

I have left some news and some news:
UPDATE:

1. AMD SKA allows getting AMD IL without having AMD GPU and also see tex:alu ratio, and other info for all AMD GPUs at the same time
2. AMD SDK ships utils source so now Nvidia and AMD OCL SDKs can be compiled in VS2010!
3. gdebugger 5.5 doesn't detect amd perf counters with 10.2 I think with 9.12 hotfix worked
not working with 10.3 beta

*See next post http://oscarbg.blogspot.com/2010/02/parallel-algorithms-avaiable-on.html
*Fermi X2 on track, possible launch date is May!
*In 2-3 weeks we have 5830 (high perf low budget card) and 2GB 6 miniDP 5870 card on 11 march!
*catalyst 10.3 beta leak avaiable go search for it! (8.71.3 CAL 556)
*gpu computing gems call
*matmul by hazeman:
it's a assembly->c port similar to 1tflops mamtmul cal example it's bad it uses her own C->IR compiler but easy port OCL? and what about perf?
*bernaclejunior is doing good job regarding sort and sparse matvec on OCL,DC..
He claims near 400Mkeys/s on vs state of the art Nvidia sorting less 200mkeys on GTX285.
Also reportedly Lee Hows has fast code working!
Some intermediate code posted on XNA and AMD forums but still not the best..
*Matvec mul high perf OCL code from Bealto (AMD and Nvidia tested).
*I have tested cubin optimized matmul code and I get 480gflop/s not bad from 380gflop/s
and also I have seen tesla computing driver no supports overclokcing in evga precision..
also gpu-z and evga not read core speed and mem speed and also not gpu usage and mem info anyway
temperature and fan speed is ok..
It's very long so (tested on vc2010rc1)
change in autoprofile:

profile_sgemm_square("../method1/decuda_ldsb32_cudasm.cubin", "method1_variant_sgemmNN", &method1_DrvWrapper, cat(OUTPUT_DIR,"method1/variant_threads320.txt") );
profile_general_sgemm_square("../method6/decuda_ldsb32_cudasm.cubin", "method6_variant_sgemmNN", &method6_DrvWrapper, cat(OUTPUT_DIR,"method6/variant_threads320.txt") );
profile_general_sgemm_square("../method7/decuda_ldsb32_cudasm.cubin", "method7_variant_sgemmNN",
&method7_DrvWrapper, cat(OUTPUT_DIR,"method7/variant_threads320.txt") );
profile_sgemm_square("../method8/decuda_ldsb32_cudasm.cubin", "method8_variant_sgemmNN",
&method8_DrvWrapper, cat(OUTPUT_DIR,"method8/variant_threads256.txt") );

variants are the fastest and 1 is the best (480gflops/s). also set:

for( n1 = 32 ; n1 <= 4096 ; n1+=96) for( n1 = 5 ; n1 <= 4096 ; n1++)

result is 100x test speed in

->profile_general_sgemm_square(profile_general_sgemm_suqare.cpp,profile_sgemm_suqare.cpp)
->profile_CUBLAS_overN

* I also have tested voxel sparse demo and fixed for tcc but building 1gb samples crashes on
ball example no mem with x32 release exe but x64 crashes anyway have to fix..
Found also sibenik and Fairy scenes but I don't know how to build sibenik-d example displacament
mapped using bump map texs(?)
I have to test it..

Antialiasing in Deferred shading GL code

new GL multivendor SM5.0 info found in 10.3:
*GL_EXT_tessellation_shader

gl_TessCoord gl_TessLevelOuter gl_TessLevelInner

*GL_EXT_shader_subroutine
*GL_EXT_gpu_shader5

memoryBarrier bitCount findLSB findMSB bitfieldReverse bitfieldInsert bitfieldExtract floatBitsToInt floatBitsToUint intBitsToFloat uintBitsToFloat

*GL_EXT_gpu_shader_fp64
new in 10.3:
*GL_EXT_shader_atomic_counters
GL_MAX_ATOMIC_COUNTERS_EXT

glResetAtomicCounter
check fail: index must be a constant in atomic counter functions
gl_MaxAtomicCountersEXT
atomicCounterIncrementEXT atomicCounterDecrementEXT atomicCounterEXT
imageAtomicAdd imageAtomicSub imageAtomicMin imageAtomicMax imageAtomicIncWrap imageAtomicDecWrap imageAtomicAnd imageAtomicOr imageAtomicXor imageAtomicExchange imageAtomicCompSwap

GL_EXT_texture_compression_bptc (replaces amd extensions)
GL_AMD_conservative_depth

OpenCL:
1.pinned mem enabled on nvidia via:

I use

host_mem = clCreateBuffer(context,
CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_WRITE,
size,NULL,&ocl_err);
*ptr = (void*)clEnqueueMapBuffer(cmd_queue,host_mem,
CL_TRUE,CL_MAP_READ|CL_MAP_WRITE,
0,size,0,NULL,&evt,&ocl_err);

to create page locked memory using the NVIDIA driver, where it works fine. However on my AMD card this makes no difference to malloced memory.

AMD guys confirm still not working.
2. cvs with icd code is for Khronos members spec updated
3. new cl headers at khronos list funcaddress used by ICD.. has cl_ext.h and cl_gl_ext.h
http://www.khronos.org/registry/cl/ headers
4. OpenCL 2.01 on Ubuntu 9.10:

You indeed have to boot with "nopat" or use Catalyst 10.2, when it becomes available. CAL version >= 1.4.553 to get this working without "nopat" option.

XvBA and other linux video decoding updates:
*vaapi guy working on Crystal HD support?
For Crystal HD demos:
Crystal HD SDK from GIT, as of 2010/02/15.

*at least supported in basic samples
*xvba now working for vlc 1.1git and gnash via updates for xvba-vaapi and gnash

Status of Xbva:
Works with MPLAYER (ass subtitles included),VLC and GNASH!
issues:
1.First broken decode in 5xxx..
2.Deinterlacing is broken in XvBA. It's the second most critical bug that has to be fixed by the
end of April.
(Only bob deinterlacing at this time. More elaborated deinterlacers are not, and won't be, exposed to the public builds of xvba-video.)

Changelog:

Version 0.6.5 - 08.Feb.2010
* Add brightness/contrast/hue/saturation display attributes
* Fix vaPutSurface() window resize. e.g. when switching to full-screen mode
* Allow vaPutSurface() to render to multiple drawables from a single surface

Notes:
- My ProcAmp adjustments are probably not fully correct. e.g. hue doesn't preserve luminance yet. Besides, this uses an extra FBO.
- The last change workarounds a bug in the driver and now makes it possible to use VA-API acceleration with Gnash with the the AGG renderer. However, this exhausts another performance problem (flickering in windowed mode) of the driver. You can workaround that with XVBA_VIDEO_PUTSURFACE_FAST set to "yes" or "1". The semantics are not fully equivalent and can cause problems, hence it's disabled by default though it's designed to work with Gnash and MPlayer.
There is already native VA-API support for G45. At this time, it only does MPEG-2 VLD, i.e. full video decode. Intel is working on H.264 support and this should be available by Q2. I don't think there is any H.264 video decoding at Gallium3D level yet, so VDPAU / VA-API support would be useless at this time.

Version 0.6.6 - 11.Feb.2010
* Fix XvBA objects destruction for fglrx >= 8.70.3
* Fix vaPutImage() to a surface used for decoding
* Fix vaGetImage()/vaPutSurface() with surface dimensions not a multiple of 16
* Fix rendering of VA subpictures that were previously deassociated

The third change is actually two different workarounds for a single and major flaw in XvBA. I have not fully regression tested but this looks OK for MPlayer, Gnash and VLC. This should fix Kano problems.
The fourth change is a fix for MPlayer/VA-API with ASS support, and that I will probably upload tomorrow. I have to check against the latest Intel drivers first. NVIDIA is already fine.

With this mplayer-vaapi snapshot and xvba 0.6.6 ASS works!

Version 0.6.7 - 18.Feb.2010
* Use fail-safe values for H.264 videos encoded over HP@L4.1
* Fix hue rotation to preserve luminance
* Fix internal contrast range to [ 0.0f .. 10.0f ]
* Fix rendering of multiple subpictures per surface
* Fix vaCopySurfaceGLX() for surfaces with dimensions not a multiple of 16

- The first change ensures that we don't crash or do weird things if we throw unsupported H.264 contents to the decoder. Wel, it
tries to get things on a safer side, without really fixing it.

- The ProcAmp changes are probably still not correct but this looks better for contrast and hue rotation.

- The fourth change fixes rendering of multiple subpictures per surface. In particular, you can now have OSD + EOSD + ProcAmp
adjustment bars (3 subpictures) in MPlayer without crashing the application.

- The last change is a workaround for a serious XvBA flaw, now implemented in vaCopySurfaceGLX(). e.g. for mplayer -vo vaapi:gl -va
vaapi. As a side effect, this would also workaround another limitation in the future iteration (0.6.8) whereby only GL_BGRA textures
are supported at this time.

Mplayer vaapi

Version 2010.02.12
* Fix YV12 rendering for SW codecs
* Add EOSD support (ASS subtitles)
* Add compatibility with original VA-API 0.29
* Add support for -geometry +xxx+yyy (Adam Strzelecki)

For EOSD & AMD, you need xvba-video >= 0.6.6.

Posted in | No comments

Thursday, 18 February 2010

Learned from voxel rendering demo code: CUDA 3.0 how to change cache size (for Fermi) function found!

Posted on 12:21 by Unknown

its in voxel code:
\efficient-sparse-voxel-octrees\src\framework\base\dllimport.inl
cuFuncSetCacheConfig
cuFuncSetCacheConfig, (CUfunction hfunc, CUfunc_cache config), (hfunc, config))
also other functions i didn't know in:
cuGraphicsSubResourceGetMappedArray
cuGetExportTable

Also they don't use GLEW and initialize..
other tricks:
CPU trick:

// Force the main thread to run on a single core.
SetThreadAffinityMask(GetCurrentThread(), 1);

GPU trick:

flags |= CU_CTX_SCHED_SPIN; // use sync() if you want to yield
#if (CUDA_VERSION >= 2030)
flags |= CU_CTX_LMEM_RESIZE_TO_MAX; // reduce launch overhead with large localmem
#endif
what about CU_CTX_LMEM_RESIZE_TO_MAX?

Also Voxel raycasting demo has good code supports Stereo OpenGL rendering and GUI controls!! for Quadros!
and good code multisampling..

also you can see functions added since 2.1:

#if (CUDA_VERSION >= 2020)
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuDriverGetVersion, (int *driverVersion), (driverVersion))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuMemHostAlloc, (void **pp, size_t bytesize, unsigned int Flags), (pp, bytesize, Flags))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuMemHostGetDevicePointer, (CUdeviceptr *pdptr, void *p, unsigned int Flags), (pdptr, p, Flags))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuFuncGetAttribute, (int *pi, CUfunction_attribute attrib, CUfunction hfunc), (pi, attrib, hfunc))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuTexRefSetAddress2D, (CUtexref hTexRef, const CUDA_ARRAY_DESCRIPTOR *desc, CUdeviceptr dptr, unsigned int Pitch), (hTexRef, desc, dptr, Pitch))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuWGLGetDevice, (CUdevice *pDevice, HGPUNV hGpu), (pDevice, hGpu))
#endif

#if (CUDA_VERSION >= 2030)
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuMemHostGetFlags, (unsigned int *pFlags, void *p), (pFlags, p))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuGLSetBufferObjectMapFlags, (GLuint buffer, unsigned int Flags), (buffer, Flags))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuGLMapBufferObjectAsync, (CUdeviceptr *dptr, unsigned int *size, GLuint buffer, CUstream hStream), (dptr, size, buffer, hStream))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuGLUnmapBufferObjectAsync, (GLuint buffer, CUstream hStream), (buffer, hStream))
#endif

#if (CUDA_VERSION >= 3000)
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuMemcpyDtoDAsync, (CUdeviceptr dstDevice, CUdeviceptr srcDevice, unsigned int ByteCount, CUstream hStream), (dstDevice, srcDevice, ByteCount, hStream))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuFuncSetCacheConfig, (CUfunction hfunc, CUfunc_cache config), (hfunc, config))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuGraphicsUnregisterResource, (CUgraphicsResource resource), (resource))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuGraphicsSubResourceGetMappedArray, (CUarray *pArray, CUgraphicsResource resource, unsigned int arrayIndex, unsigned int mipLevel), (pArray, resource, arrayIndex, mipLevel))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuGraphicsResourceGetMappedPointer, (CUdeviceptr *pDevPtr, unsigned int *pSize, CUgraphicsResource resource), (pDevPtr, pSize, resource))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuGraphicsResourceSetMapFlags, (CUgraphicsResource resource, unsigned int flags), (resource, flags))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuGraphicsMapResources, (CUgraphicsResource *resources, CUstream hStream), (resources, hStream))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuGraphicsUnmapResources, (unsigned int count, CUgraphicsResource *resources, CUstream hStream), (count, resources, hStream))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuGetExportTable, (const void **ppExportTable, const CUuuid *pExportTableId), (ppExportTable, pExportTableId))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuGraphicsGLRegisterBuffer, (CUgraphicsResource *pCudaResource, GLuint buffer, unsigned int Flags), (pCudaResource, buffer, Flags))
FW_DLL_IMPORT_RETV( CUresult, CUDAAPI, cuGraphicsGLRegisterImage, (CUgraphicsResource *pCudaResource, GLuint image, GLenum target, unsigned int Flags), (pCudaResource, image, target, Flags))
#endif

currently fails with CUDA Compute Cluster driver:
in CudaModule::staticInit(void)
change that:
checkError("cuGLCtxCreate", cuGLCtxCreate(&s_context, flags, s_device));
by
if(tcc)
{
checkError("cuCtxCreate", cuCtxCreate(&s_context, flags, s_device));
//res = cuGLInit();
}
else
checkError("cuGLCtxCreate", cuGLCtxCreate(&s_context, flags, s_device));
cuglinit perhaps needed but depecrated anyway
changed in cuInit(0); or after cuctxcreate?

also if tcc was more smart would work and fallback
to host interop as CUDA already does so I think directly
all CUDA GL functions return error in tcc..

anyway thanks good code change:
Buffer::Hint_CudaGLin CudaRenderer::CudaRenderer(void) to Buffer::Hint_None
so
: m_frameBuffer (NULL, 0, Buffer::Hint_None),//Buffer::Hint_CudaGL),

Posted in | No comments

A month of news!

Posted on 10:49 by Unknown

So here it goes all random news I consider interesting in this past month:
* AMD CAL libs coming to MAC? In PGI 10.2 pgaccelinfo includes -ati -amd to report ati accelerators info.. This is in Mac release too.. and says libamdcalcl.dylib not found.. so seems
is not working?
This will close the hole of having standard OpenCL in 3 OSes and also CUDA and CAL on three Oses also..
Remember related news is PGI interested in using Noveau stack as base of enabling GPU computing stack thorugh it for OpenSolaris and FreeBSD after Nvidia spoke about Solaris and demonstrated(?) in GTC08.. but now is dead..
For Windows pgaccelinfo working copy aticalcl.dll to libamdcalcl.dll in dir and also calrt and it works.. So seems really PGI has AMD CAL for MAC.. as has linked the dylib no?
I hope they don't spend too energies working on it since OpenCL is better target for PGI accelerator model..
perhaps is good mail streamdeveloper amd dot com asking it..
* Par4All allows autopar for CUDA, etc..
* After AMD assembly matmul kernel achieved 1tflop on 48xx hardware and 58xxx should be 2tflops
now we have assembly optimized matmul for Nividia having 10-20% better perf.
search "Hand-Tuned SGEMM on GT200 GPU, 10% ~ 20% improvement of SGEMM"
allows 512 gflops gtx 285-> 1tflop matmul for Fermi? (like larrabe? mira acm video)
it has code and report..
Also has trick that using asm("") in cu kernel including PTX works via nvcc due to Open64 features..
*Nvidia has released updated videos on Youtube demos of "fluid demo" for fermi launch, Parallel Nsight (nexus) and one about Sled demo talking a enginner about it..
* I still don't know if some opencl.dll from Khronos works for Nvidia and AMD cards simultaneously..
some one says 2.01 opencl.dll works for two simultaneous..
I don't know but seems AMD works with Nvidia opencl.dll if I have Tesla Computing driver
Related khronos icd released tough things to remember are now you can program compatible OpenCL ICD with the doc and also that through ICD some functions which can not be resolved to concrete platform as unload compiler are "no operation"..
Another thing is that 2.01 dll seems has d3d10,d3d9 interop functions(?) or this are getted via
ICD that supports functions not exported through it, I must see..
also Nvidia has d3d11 interop what about AMD?..
Also spec has some cvs links from Khronos for getting some code (ICD loader code?) so someone can mail khronos jon leech for ex. for khronos cvs icd password..
*cudpp has now triangular solvers from 2010 paper..
still waiting for adding sa2009 paper hash functions..
Also a survey has been released saying in which to devote more energies: double supp, graph functions etc..
*bad article by demerijan about Fermi
http://www.semiaccurate.com/2010/02/17/nvidias-fermigtx480-broken-and-unfixable/
but Nvidia seems confident and set clocks for Fermi this week and seems also mid range and other cards taped out some time ago..
*cusp progressing towards dense math(?) has matmul dense and lu solve seems.
* Still clGetGLContextInfoKHR not usable altough present in header 2.01 (was it before in 2.0?)
also some string in Khronos ICD dll but no in lib and dll's really..
*Linux news:
Catalyst 10.2 has direct2d based acceleration search phoronix
also now Noveau has Galluim 3d support in Fedora 3 (working OpenGL ES 2.0 and OpenVG state trackers?)
Heaven benchmark for Linux coming in March for GDC? new version for sure (support for Fermi seems also as now Catalyst 10.2 shows all big sm5.0 features going trough EXT as double support (ext_fp64), shader model 5.0 (ext_shader5), tesselation stuff (ext_tessaltion_shader)
still no standard ext's for HDR new tex compression shipping but no doc and also similar for radnom accest target..
*OpenGL and OpenVG demos:
Some nice code and tutorials found on web:
->OpenGL geometry shader one pass texture cubemap render (3 ways)
->OpenGL GEO culling ->from 2.1billion to 2 million works ATI and Nvidia is 3.2 code..
->Complex OpenVG demo from SA 2009 Khronos presentation (animation)
->OpenGL uniforms vs texture objects.
->Hardware Tessellation on Radeon in OpenGL (geeks3d):
says there are two tesselators in 5xxx extensions
->Mali SDK UI 2.3, Tegra Khronos SDK..
->Code from Stanford Iphone GL ES course
http://www.khronos.org/news/multimedia/optimizing-opengl-for-iphone-stanford-university
-> OpenGL 3.2 samples:
http://nopper.tv/opengl_3_2.html
g-truc ->OpenGL 3 Samples Pack 1.2.1 released
*Also seems WebGL released spec at GDC09 as some talks from Khronos.. also Firefox 3.7 will have it and roadmaps plan for mid year now at alpha 2.
*I have found on ACM video rattner sc09 shows Larrabee demo matmul and sparse math..
More videos are from AMD OpenCL PHD boy..
*Would be nice if optix gets upgraded for:
->Breadth first abd packet ray compression via sort paper EG2010.. improves kernels Timo and Aila used in Optix?
improves raytracing 2x-4x shadows kernels
->Include Sparse Voxel Raycasting I3D 2010 paper
->OpenRL compatibilty.. see diferences are small..
Regarding id3 2010 for me in only remains to be seen stocastic transparency bi Enderton..
Also OpenRL is going to Khronos similar to OpenCL by Apple was.. must check similarities to OpenRT (previous standard )
* Seems AMD drivers for Windows 7 in GDI mode has a bug:
In the same artice some info on GDI accelerated on XP and 7 but not vista..
Also in 7 is in Aero only..
gdi bug 5xxx series:
http://www.tomshardware.com/reviews/2d-windows-gdi,2547-15.html
AMD has supplied hotfix and seems 10.2 WHQL doesn't contain it so perhaps 10.3? or 10.4
good theory about gdi on Windows.. disabled in Vista..
download 2dbench de tomshardware for checking perd..
* opennl 3.0 released having CUDA numerical libraries (CNC and CUSP similar?)
* Sparse voxels octree I3D 2010 paper avaiable and extended NV tech report '10 #1 with more photos and gtx285 perf.. also video and code avaiable in google code cuda voxel raycasting project..
see realtimerendering blog post..
* tegra2 full sdk
has now Android 2.1 images and Khronos full SDK (tegra khronos sdk)..
also seems video compression via OpenMAX in Linux and Android already?..
* Current Catalyst are 10.2 (8.70.2) whql and avaiable 8.70.3( only changes OPenGL version no cal no d3d)
beta given to press 10.3 is ati 8.71.3..
Now about it
3d hooks info is needed and good if enable opengl qb stereo on radeon..
better a sdk as with sample of d3d driver hooks similar to 3d vision is used in Avatar..
*There is a gpu-z enabled opencl ati I don't know if checks correctly or only enables ok..
*Now there are GDC 2010 info from Nvidia in developer.nvidia.com and from intel gdc 2010
From Intel expect:
->GPA 3.0
You'll see in-depth, real-time demos of GPA 3.0, including the much anticipated advanced
thread/task timeline that helps optimize task-based threading. New features such as automated
summarization of your game engine’s performance on multi-core CPUs, the DirectX API, and the
GPU will have you breath-ing a sigh of relief. Platform performance analysis has finally
arrived.
->Intel C++ Compiler version 12 info
This session in-cludes a review of the new automatic vectorization features in the upcoming
Intel C++ Compiler version 12.
->Tickertape
Shows a highly-threaded particle system with orientable quads — like paper in a parade. Particles are affected not only by gravity, but also by air resistance and wind.
*Book Programing ... by Kirk released is CUDA book..
Materials are here:
http://www.elsevierdirect.com/companion.jsp?ISBN=9780123814722
There is also a 3 chapter sample..
*In Khronos I have found a OpenCL NVIDIA build of 2010-02-03
Released soon?
Also a ARM Cortex A9 one:
Samsung Electronics 2010-02-03 OpenCL_1_0
Embedded Linux System with SAMSUNG OpenCL Library with OpenCL running on a ARM Cortex-A9
MPCore CPU.
* Realistic Demo Crymod: Widet2_Benchmark_alpha.7z
*From Caustics:
"due to be released in March"

OpenRL™ SDK Public BETA Registration

Caustic Graphics is about to achieve our next major milestone in bringing cinema quality graphics to every display. We are introducing our OpenRL SDK V1.0 restricted BETA release this week, which is the first implementation of our Open Ray Tracing Language (OpenRL) specification. The OpenRL SDK also includes our new OpenRL shading language (RLSL), which is based on GLSL and provides run-time compiled programmable shaders for ray tracing.

Similar to OpenGL for rasterization, the OpenRL specification is a framework for writing ray tracing applications that execute across heterogeneous compute platforms. Today there is no open standard, cross-platform API for ray tracing. Consequently developers must program their ray tracing applications "to the metal" or accept “vendor lock-in” by using a proprietary closed standard that is limited to a specific subset of hardware.

Later this year, we will be proposing the OpenRL specification as an open standard to the non-profit technology consortium, the Khronos Group. Moreover, we will actively solicit and support the introduction of third-party implementations of OpenRL. In the meantime, we are pleased to introduce the first implementation of the OpenRL specification, which we are calling the OpenRL SDK.

Some quick facts and features slated for the OpenRL SDK:
OS support for Windows, Mac OS X, and Linux;
Uses all OpenCL-based GPUs (e.g., AMD, nVidia, S3) and x86 CPUs (AMD, Intel) simultaneously;
Adding more compute delivers an immediate and nearly linear performance boost;
Plugging in one or more CausticOne or CausticTwo cards delivers the ultimate in ray tracing acceleration.
Target markets include but are not limited to, Film, Video, Games, Transportation, Education, Consumer Products, Architecture, Engineering, and Construction.

We would like to invite you to participate in our OpenRL BETA public program, slated for release this quarter. The OpenRL SDK Public BETA program will include free access to our developer forum where you can post your questions and answers to the OpenRL SDK, RLSL, CausticOne and CausticTwo.

Fill out the form below. Upon release we will send you an email with instructions to download the OpenRL SDK.

P.S. - For those of you who signed up for the CausticRT Emulator, well don't fret. The OpenRL SDK name supersedes CausticRT and CausticGL, whose names will be retired upon release of the production version of the OpenRL SDK.

So OpenCL based and submitting to Khronos..
S3 support intigues me as no driver supports it?

*gdebugger 5.5 with new AMD support for (Catalyst 9.12 and up) performance counters
Also gdebugger cl in beta soon..
*ati OpenCl released 2.01
at least fixes pcchen 8 - knights demo ..
Still no bugs for Apple FFT code fixed but reportedly fixed internally by AMD..
Still not now if OpenCL OpenMM is fixed and about early pyrit builds that now have contermeasures..

*10.6.3 check opengl 3.2 nvidia doubles and cl ati image and ati cal

RAW:
catalyst 10.2 i 10.3 news (8.71.3) 3d qb for d3d (can enable qb 3d ogl via ocl dx ogl interop?)
58xx xbvau not work but patch similar to 4xxx card bug earlier will fix it
fglrx 10.4 ubuntu driver fixed by then..
pgi 10.2 pgaccelinfo has cal info and libamdcal.dylib not found (amd has cal for mac?)
gdc eyefinity sdk?

Catalyst 10.2 has 181 GL extensions!

3 new, 1 EXT, 2 ARB:
GL_ARB_blend_func_extended - more enhancements to blending? whats left in DX10/11 that OGL doesn't have?

GL_ARB_fragment_coord_conventions - DX9 compatibility (wasn't this in OpenGl 3.2?!? still missing transform_feedback2) no estaba en 9.12 hotfix

GL_EXT_texture_buffer_object_rgb32 - this one is interesting as GL_ARB_texture_buffer_object already lists all the RGBA32 F, I, and UI.
ojo vi en fermi 195 drivers

Also I note that 2 amd extensions have been documented:
http://www.opengl.org/registry/specs/AMD/seamless_cubemap_per_texture.txt - when did this get added?
http://www.opengl.org/registry/specs/AMD/shader_stencil_export.txt - from 10.1

Wonder how far away we are from GL 3.3. Still haven't seen DX11 stuff yet, but they must be working on it!

Can't see any sign of the rumored (or under NDA) per-game application profile support yet in CCC. Supposed to be in 10.2...

tesla computing driver released 19.628 64 bits windows 2800 r2: opencl support?, nexus with ati?compute exclusive timeout
*Still no compiler no doubles feb 2010 directx sdk
*fermi 4x slowdown doubles

Posted in | No comments

About Tesla computing driver!

Posted on 06:39 by Unknown

Hi boys,
I'm becoming increasingly lazy in publishing stories.. sorry for that..
A good megacompilation is coming this week..
Anyway today is old news for installing Tesla Computing driver (196.28) see info (slide 35)

Tesla Compute Cluster (TCC)Driver
Enables Windows HPC on Tesla
Enables Tesla without a NVIDIA graphics card with Windows 7, Server 2008 R2, Windows Vista, Server 2008
Only Tesla 8-series, 10-series and 20-series supported
Only works with CUDA
Does not support OpenGL and DirectX
Available in beta now, release in Jan 2010
Enables the following features under Windows with CUDARDP (Remote Desktop)
Launch CUDA applications via Windows Services
No Windows Timeout issues
No penalty on launch overhead
KVM-over-IP enabled (CPU Server on-board graphics chipset enabled)

Teoretically is for Tesla and Windows 2008R2 64 bit only but I have succefussly installed on Windows 7 on GTX 275..
so installing compute driver is possible on Geforce
just locate NVWD.inf and add
under [NVIDIA_SetA_Devices.NTamd64.6.0]
%NVIDIA_DEV.05E6.01% = Section001, PCI\VEN_10DE&DEV_05E6
and under [NVIDIA_SetA_Devices.NTamd64.6.1]
%NVIDIA_DEV.05E6.01% = Section002, PCI\VEN_10DE&DEV_05E6
this is for GTX 275 for others locate in inf for your card..

Only supports CUDA at the moment (CUDA C(++)).. of course I have checked and PGI Fortran detects it so also CUDA Fortran..
OpenCL doesn't work currently but I think they will add support for it in 200 series..
DirectCompute I have not many hopes.. as device detection is for graphic devices and doesn't expose DirectX..

Have to test Badaboom for seeing if CUVID works and also what about CUDA video encoding via kernels is working (or CUVENC..)
I have to test if OptiX works (I hope but not graphic demos so save render to a file and check)
Also what about PhysX.. Theoretically should work as no interop with graphics is currently enabled at least this enables ATi rendering+PhysX work..
Have to test..
At least has no graphics API dependencies

Also what about other CUDA strong programs as Vreveal..

Last would be good if supported Nexus as I have ATI+Nvidia but Nvidia normal driver only enables CUDA if you extended desktop to *at least ONE Nvidia GPU*..
i.e. if you have 2 nvidias you can use Nexus but ATI+nvidia no work as extending desktop crashes GPU debugger and no extending it doesn't enable Nvidia driver (Windows 7 limitation?)..

Official response:
The Nexus Beta currently only officially supports the 195.62 driver from nvidia.com. Regarding support of the TCC driver, it is not currently supported, but Nexus debugging support using TCC is something we are considering for a future build.

Posted in link above:

Hi,
some questions:
I have tested on a GTX 275 (adding device id to the driver inf) with AMD 5850 as display good work..
(hmm I hope by saying that please don't block this possibilty similar to how are you going to block double prec potential by slowing down 4x on geforce fermi cards..)
By "Only CUDA is supported in this release" you mean OpenCL is supported right now?
I have tested OpenCL ocldeviceQuery and fails to search platform ID and is clear the driver doesn't include opencl.dll nor nvcompiler.dll..
Using that dlls from 196.34 don't work altough nvcuda.dll seems to have Khronos ICD entry points..
are you going to support OpenCL soon on Tesla driver?
I have not tested but are CUDA programs using textures mean to work? I assume yes altough a graphics feature..
Also what about for DirectCompute? i.e for DirectCompute apps not using graphics are you going to support it?
And finally Nexus, I have access to Nexus beta which supports by using two Nvidia cards in one PC debugging in one computer.. is this supported by using as display device an ATI card now one Nvidia card has not to have extended desktop
I hope you add support for Nexus for Tesla Computing driver in case it isn't supported right now..

Posted in | No comments

Friday, 5 February 2010

A long report of the silence before the storm: AKA a month before Fermi..

Posted on 07:29 by Unknown

Sorry raw dump of my ideas:

Altough we are a month of a complete storm if we follow carefully we can hear some thunders of that storm known as Fermi and new software updates:

First the base read graphics arch (Nvidia GF100) and compute arch (Fermi arch)..

also see Deep Dive presentation having more perf chart vs PDF in noticias3d.com or ..
Also altough not kwnown there were two more Deep Dive sessions not much talked about developer relations program showing sled info about demo and Nexus graphics debugging (the first demo I have of debugging a HLSL video as CUDA video has been posted).
Search in cz page..

Tesla computing driuver
GFX cards:
4x slower doubles?

As you will know graphics arch reveal revamped geometry power via parallel rasterizers (4 so 4x perf) and 16x geo power via putting this 16 times..
also now geo buffer and stream out buffers are using L1/L2 caches (and atomics?) so much faster
and general (removing fixed funtion hardware)..
this can be seen at least a removal of fixed functions) and generalizing to work in parallel the rasterizer..
This impacts a geometry hard game as Crysis as 60% faster not bad expecting also shader power to be near to 2x increase..
and I think of GF100 as of 4 GPUs in one chip or GPC.. at has all it needs..

right now is GTX 480 and 470 has h.264 mvc support (bluray 3d by the way HDMI 1.4 3D spec is open) (will be exposed in DXVA or what? also in CUVID VDPAU and or CUVEND?..)
as you know in Mac GPU video encoding are supported by Elemental and video decoding by a shit api (QTKIT) which not exposes decoded frames as OpenGL textures or OpenCL image objects..
Elemental ships in 2.2 with her GPU decoding so have to see is a CUVID using Snow Leo APIs or using shaders..

Also I have seen HDMI 1.4 outputs in Fermi and this would be marvelous as to interop the output
of 3D Vision to Sony 3D monitors (but what glasses I use?)

Lastly 3D Vision has now tri SLI or quad SLI support and all new monitors 24 inch support (3 or 4 right now) I have seen 27inch monitor from ASUS for early June and panels with 3d Vision and touch support are being sampled I think.. but remember
Youtube 3D Vision support, windows supported and browser integration are promised soon..

There are reports that claim

SA 2009 courses things learned:
SC 2009 courses things learned:
I3D 2010 things learned:

would be perfect for a fraps grabbing 3d Vision
One thing I'm sad it will not be is this will be of use for not halting the OS and also in
I hope Nvidia are working on right at least for near future this year..
I can't understand why not would be the case..

1. altough this is not strictly Fermi related, the much needed updates of OpenCL in MacOSX and DirectCompute in Windows are coming in a month I expect..

Direct3D SDK updates are much needed after some 5 months (a 1.5 month before Windows 7 launch) )since last update something like prehistory is this rapid changing world :-)
I hope a GDC 2010 release (so 6 months later) at least with important fixes all know issues: for double support, CS library: FFT,scan, and other fixes reported on XNA forums..

Also would be good if some samples shown by Fermi Deep Dive session at CES are given as that seems DirectX samples and released as hair demo or tesselated water demo.. AMD did the same with 5xxx code (search contributed by AMD in Direct3D SDK)..

Also good demos of Ocean demos are shown by Nvidia a OpenCL code port of DirectCompute and AMD in SA 2009 OpenCL seesion.. would be good to have this..

I am also Nvidia ships more DirectCompute demos in GPU Computing SDK 3.0 final or beta2 which I hope will be released by Fermi time..

I also hope cuprintf released two months ago is integrated in CUDA Toolkit or SDK and hopefully
ported to OpenCL for GPU printf debugging support (as said AMD supports in Linux in CPU and coming to MSVC).. Anyway I expect OCL support to be somewhat restricted due to no template support, etc..)
I would port to OCL but anyway is confidential stuff right now..

See more debugging later..

I also want to talk about CUDA SDK 3.0 a lot more as about ELF, cuda memcheck, CUDA driver RT interop,etc.. but I will wait until final PTX 2.0, 1.5 (OCL) and docs are updated..

As a check point would be good to know how ECC and L1/shared cache is configured enabled..
I remember seeing in some Quadro 195 driver released seeing something about ECC in Control Panel..
but I don't know how L1/shader mem cache is going to be used (parameter to nvcc?, CUDA API fuction,etc..)

10.6.3 is coming this month and has OpenGL 3.x support (well 3.0 seems) (altough netkas claims that not complete as OpenGL extensions viewer doesn't claim GLSL 1.5.0 required support I think this is related to no info on GL 3.x context creation has been published so it's not creating an advanced context but extensions are there.. also comparing to 10.6.2 I see two more 3.2 extensions are supported not bad.. I only hope they are two interesting ones and not directx helper extensions.. give me that plus uniform_object and TBO from 3.1 and I would be more than happy..
So I hope this are at least supported as extensions in Nvidia driver or AMD 5xxx driver..
at Netkas seems is reporting software renderer extension..
oh boy if Apple cared less about a stable platform and give GPU extensions as fast as they come in Windows and Linux would be perfect I don't care about OpenGL 3.x being implemented in software seems a mad situation as much as if Microsoft cared about DirectX reference rasterizer for running actual games (ehem it has WARP..)
If not at least expect 3.1 complete by summer (=10.6.4 or 10.6.5) and perhaps 3.2 by end this year.. so seems 3.2 complete this year..
I hope by that time having also optional 3.2 ext:
GL_ARB_draw_buffers_blend
GL_ARB_sample_shading
GL_ARB_texture_cube_map_array
GL_ARB_texture_gather
GL_ARB_texture_query_lod
at least
GL_ARB_sample_shading
GL_ARB_texture_cube_map_array
GL_ARB_texture_gather
for me are good.

News are that at WWDC is showing 10.7.0 and if you remeber in 2008 had GT200 support so perhaps at least 3.2 complete and Fermi support will be for 10.7.0 WWDC seed..

Also altough a bit premature would be good if with initial 5xxx and hopefully coming this year Fermi support adds also new shader 5.0 extensions (more later)
for me would be perfect similar to Leopard having in 10.5.2 at least a lot of G80 new extensions in Nvidia supported (geo shaders, texture feedback,etc..) ..

OpenCL for MacOS: FFT library perf fixes, also expect some improvementes as double support for Nvidia on GT2xx cards, ATI image support at least this is where I will put my effort being Apple.. Still the bad thing is Apple is no 5xxx support as AMD 4xxx don't have true local mem but this can be changing fast if rumors are true of a expected MacPro shipping this or next month with 24 hardware threads (2 6 cores 32nm Westmere) and hopefully a 5xxx card as option so perhaps good..

Before leaving MACOS also I expect CUDA updates for 3.x:
Talking CUDA on MACos:
you have cuda memcheck
cuda-gdb coming soon.. will add OpenCL at that time also?
cuda 64 bit support (for 3.x)
cuda opengl efficient support (not hoped but can be)
also would be good if for hackintosh users can use Fermi on CUDA 3.0 in MAcos..
i.e. cuda.kext exposes access to that..

Also remember Fermi support will not be completed by 3.0 release well at least if not released as beta2 in march and delay 3.0 for June summer..
so expect a lot more for 3.1 and perhaps some minus things for 3.2
if you not follow gt200 intro, 2.0 had double support and shared mem atomics but until 2.2 we hadn't host pinned mem a feature of gt200..
Amongs the things said to not be present at first are support of recursion and I think also virtual fuction calls and function pointers but I could prove wrong..

Of course this hardware features are supported by here own or since beta: 8x faster double,10x faster context switching and atomics and caches by her own and concurrent kernel and dual dma in beta.. this last two using

Talking about OpenCL:
I expect Nvidia 200.x drivers to add support for DirectX extensions (see GDC 2010).
cl_nv_d3d9_sharing
cl_nv_d3d10_sharing
cl_nv_d3d11_sharing
are published in Khronos OpenCL registry
also in 196.21 I see some d3d10 fuctions..
but that seems crazy as AMD is own DirectX extensions..
would be good khr_dx..
also 3d image writes for Fermi and perhaps half extension for all cards..

Talking about OpenGL:
By the way seeing Nvidia GDC 2010 plans seems WebGL is launching (final spec) at GDC and also expect some updates to OpenGL: well I expect a bunch of EXT extensions and NV AMD extensions supporting new D3D 11 hardware..

well since now we have shipphing two extensions in 196 driver not documented: nvx_meminfo and wgl_dx_interop..
also ATI has added: GL_AMD_shader_stencil_export
GL_AMD_seamless_cubemap_per_texture supprot in 10.1 but this is documented in Khronos (
also added GLX_INTEL_swap_event)
this last is interesting for async glutswapbuffers and events for qeurying when complete not waiting for vsync or similar..

also I hope similar to that VDPAU will come with efficient GLX interop since now it has some overheads and perhaps last extensions can help..

First see GPU Computing tools:

Regarding hardware debugging-> lots of news.

See:

With all these references you know:
For Windows you have Parallel Nsight (codename Nexus) which supports GFX and Compute debugging, profiling and API tracing all integrated in Visual Studio 2008?.. (at least now support CUDA C and HLSL DirectX 9/10 seems)..
The problem is no Windows XP so this platform..
of course upcoming is Direct3D 11 and we hope OpenCL and GLSL but that can be sometime later..
Also release (beta?) is targeted for Q1 2010..

Nvidia names Pro version with Direct
On other OSs you have Visual Profilers for CUDA/OCL in Linux/MacOS..

With that you know that cuda-gdb already has Fermi hardware support and is getting soon support for MacOS and also for OpenCL.. Use it with DDD or Emacs and you have for other

Recapilutationg earlier posts:
Solaris and FreeBSD support for CUDA is working PGI using Noveau stack..
GPU Computing book and programming gems

Raytracing:
Well you have Optix 2.0 beta1 now supports Geforce and Fermi optimizations are promised soon..
CEs videos show frame reate from 0.23 to .67 for a complex demo..
Now don't pred
It's also curious how now they claim that cache helps a lot Nvidia claim 3x improvement over GT200 (well the arhictectural perf increase has to be mitigated by core count (240/512) and speed diferences if any) so seems to me no more than 30% increase in perf per core per clock due to caches in turn agrees with

CUDA multicore:
Well that's hurting me as this is one of the true strenghts of OpenCL right now and Nvidia seems to have left both as initial work was not very good (MCUDA) download it and see a lot of restrictions (texture support not)
and AMD how it is:
well see OpenGL extensions

well with this you can at least check the diff between what I claim and

Catalyst 10.2 RC2 expose that AMD is going the route of exposing extensions as EXTs ones so Fermi and AMD will interoperate and hope Heaven OpenGL demo with tesselation for Linux (windows support also?) is Fermi capable because of that but that also seems no support until March/April 2010 (10.3 or 10.4) as 10.2 has not exposed it..
GL_AMD_gpu_shader5
GL_AMD_conservative_depth
GL_EXT_texture_buffer_object_rgb32
GL_EXT_gpu_shader_fp64
GL_EXT_tessellation_shader
GL_EXT_shader_subroutine
GL_EXT_gpu_shader5
this are found on that and as you see
GL_EXT_gpu_shader5 and GL_AMD_gpu_shader5 seems similar so
no interesting AMD extensions excepting
stencil shader write GL_AMD_conservative_depth
amd random access target..
CAL is at 55x build now
in March at 6xx build final OpenCL SDK
you can find on 10.2:
Hull shader(s) were not successfully compiled before glLinkProgram() was called. Link failed.
Domain shader(s) were not successfully compiled before glLinkProgram() was called. Link failed.
gl_FragStencilRefAMD
subroutineEXT uniform

I have found some of this on Nvidia driver so seems crossvendor D3D 11 OGL extensions are
coming soon (nvidia launch day and ATI at GDC or April or May I hope)

Hopefully Ubuntu 10.4 AMD driver (fglrx 10.4 beta) ships in mid March also has adds with OpenCL in driver support so no more SDK with OpenCL.so that would be perfect if they can ship also with image support, production ogl interop and byte_addresable_store.. assuming atomics local and global are prodution quality I don't know.. also hope that as VGA arbitration is supported I can have simultanoeus AMD and Nvidia GPus working and OpenCL detecting two platforms.. A dream come true :-)

Also from GDC 10:
Nexus:NVIDIA's New Game Development Environment: NVIDIA Parallel Nsight
http://developer.nvidia.com/nsight
seems APEX tools are coming (anounced detailts at GDC 09)
and for Tegra profiling PerfHUD ES coming..

Latly not related but talking about Ipad and MacOS in general..
first MacPro said with 5xxx and also 10.7 seed in June and touch Imacs coming..

Posted in | No comments

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Friday, 26 February 2010

Reading Fermi CUDA stuff!

Thursday, 25 February 2010

Questions about OpenCL AMD d3d9 interop!

News 25/2!

3 new tools!

What is it?

Why might you want it?

Limitations

Ideas for porting algos to GPU:AVX SSE and MMX ports!

About ATI and Nvidia drivers (OCL included)!

Shaders: measuring perf, source translation and parsing different languages!

Friday, 19 February 2010

Enabling OpenCL Image support on AMD GPUs!

Running QT everywhere!

Parallel algorithms avaiable on CUDA,OCL,DC,CAL: status update

More news!

Thursday, 18 February 2010

Learned from voxel rendering demo code: CUDA 3.0 how to change cache size (for Fermi) function found!

A month of news!

About Tesla computing driver!

Friday, 5 February 2010

A long report of the silence before the storm: AKA a month before Fermi..

Popular Posts

Blog Archive

About Me