Dificulties in coding, achieving high perf an measuring MultiGPU code! ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

A lot of CUDA software is no MultiGPU aware:
Badaboom uses multiple GPUs for multiple videos but not for one..
Other CUDA video encoders I doubt so..

there is why:
The first problem is coding it:
if CPU code needs thread API, thread pools, etc..
GPU needs careful coding for dividing work and send to every GPU..
Also taking care of async kernels exec and mem copies and multiple streams and you get crazy..
See gpuworker cuda forums
for and easy way..
also CUda openmp example..

Achieving higp perf. if not perfectly divisible need interchange data one GPU to another..
also this requires currently host intervention and can only be minimized time if pinned mem shared to both GPU so pinned shared mem of CUDA 2.2 is need..
For clusters where a GPU to GPU transfer may need going to NICs wait for 2010 when transfers will use DMA from GPU to pinned host mem by NIC (?)..

Measuring multiGPU perf, well let's talk about it, of course we can add multiGPU support to some CUDA codes but the intrinsic problem in this and a lot of GPGPU apps lies in how you measure perf.. A lot of scores are measured with inputs and outputs are get in GPU mem.. if get in CPU mem we get no linear scaling with GPU shader count as GPU-CPU transfers are counted which amount constant time (they will improve only with PCI Express versions)..
I think Larabee perf and the CUDA matmul figures vendors show us are with data on GPUs.. with multiple GPUs you may transfer at least from one GPU to another GPU which currently there is no fast way for doing it in GPU Computing APIs and requires going through host so you would get no apple to apples comparison.. you have to compare to benches with inputs and outputs in CPU mem which anyway is not a "true" benchmark as I said before not scales with shader count..
think of it as CPU benchmarks that acounted for time of reading/ writing input data to hard disks..
note there have been great strides this year for using multiple GPUs to the point of being able to transfer data between graphics APIs in OpenGL with AMD and Nvidia propietary extensions
for Nvidia see http://www.opengl.org/registry/specs/NV/copy_image.txt
search wglCopyImageSubDataNN
for AMD see http://www.opengl.org/registry/specs/AMD/wgl_gpu_association.txt

To facilitate high performance data communication between multiple
contexts, a new function is necessary to blit data from one context
to another.

VOID wglBlitContextFramebufferAMD(HGLRC dstCtx, GLint srcX0, GLint srcY0,
GLint srcX1, GLint srcY1, GLint dstX0,
GLint dstY0, GLint dstX1, GLint dstY1,
GLbitfield mask, GLenum filter);

We can try to echange data for multiple GPUs using Computing APIs with OpenGL interop and this OpenGL extensions..
i.e CUDA OpenGL itnerop and OpenCL OpenGL interop.. note CAL OpenGL is surely coming for AMDs but currently lacking..

I have to ask vendors (Nvidia and AMD) what are they doing for developers being able to transfer data between GPUs without CPU host intervention using DMA engines.. in both CUDA and OpenCL..
note at SC09 Nvidia anounced that for spring next year you will have a solution for a similar problem: for the cluster
enviroment i.e. transfering from GPU to NICs without host intervention I think..

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Sunday, 13 December 2009

Dificulties in coding, achieving high perf an measuring MultiGPU code!

0 comments:

Post a Comment

Popular Posts

Blog Archive

About Me