Double precision support in GPU computing APIs and GPUs and emulating it.. ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Tody I'm going to explore double precision support in all GPU computing APIs and hardware:

CUDA
====

CUDA has been the first to support it for Nvidia cards since version 2.0 (June 2008) for GTX 200 cards and his Tesla based cards.
We have to note some things:

1. Double precision support corresponds to SM 1.3 capabilities (see cudaDeviceQuery sample for how programmatically know if the GPU has support for it). Note that this also precludes all the GT 220,240,250 chips based on GT 200 architecture which are currenly shipping in this fall season of Windows 7 renewed notebooks.. In fact this are all this present only 1.2 support.. Also note that due to lame naming some notebook GPUs (as GTX 260M) by which the name would suggest GTX 200 core are in fact based in G9x chips and as such not support double precision..
In fact any currently Nvidia mobile chip is shipping with double precision support, (excluding the possibility of some crazy desktop replacament notebook shipping with desktop parts)..
Also note that DPFP is lacking also in ION (based on G9x core) and presumibly also ION 2 GPUs (based on GT 220-250 core).. Moving lower end it's also lacking in Tegra (Geforce 6 based and in fact doesn't support CUDA) and also presumably Tegra 2 chips (G9x based which should support CUDA and OpenCL expected to be announced for Mobile World Congress 2010 in February next year).
Resuming support:
G80,G9x ->No
GTX 200 ->Yes (GTX 260, GTX 275, GTX 280, GTX 285, GTX 295)
GT 200 ->No
GTX 200 Mobile ->No
Ion,Ion 2 ->No
Tegra 1/2 ->No

2. Currently Nvidia cards have low performance in double precision computations (say 1/8 of single precision performances). That amounts to 80-90 Gflops which in turn is nearly is double of a high end quad core Nehaleam/Penry/Core Quad at 2.66-3GHz..
Thats going to change with Fermi (Nvidia next GPU) wich will have half performance vs its single precision performance wich in turn amounts to 8x increase in performance vs GTX 200 cards (which amounts to around 750 Gflops assuming core clock of 1.5Ghz)

3. You have to pass a compiler option to nvcc to enable double precision support (-arch=sm_13), it's not just enough to declare your variables as double as it will get promoted to floats.

CAL and Brook+
==============

CAL supports double precision if I remember correctly since 2007 year end.. i.e.
since first CAL was released and at the time it was also a feature only shipping in 3870 cards..
Support for a non assembly level language came later I think at the time of 4xxx series..

Anyway double precision support is only on high end series 38xx, 48xx and 58xx..
Also note that double precision performance has been better than on Nvidia i.e. at 1/4 or 1/5 approx. depend on how you count.. That is currently 544 Gflops on ATI 5870 cards..

Also note that at least at the AMD IL level (for CAL) double precision is generally supported by appending d to a lot of single precision functions (so write dmad,dmul,dadd instead of mad,mul,add) and that the vector functions instead of operating on 4 elements on the same time (on .x,.y.,.z,.w components) they operated on 2 elements. In that case .xy and .zw store the double precision values using two 32-bit registers.

DirectCompute
=============

Double precision support is only one of the optional features of the Direct3D 11.0 API. This is only supported in compute shaders (mm, I think..) and also in compute shaders 5.0. Note this precludes Nvidia shipping Direct3D 11 shaders with Direct3D level 10 and with compute shaders with double precision support (which in fact some of their hardware support it (read GTX 200))..
Of course it's expected than Nvidia Fermi which is in turn a GPU designed for Direct3D 11 among other things supports it..

Note I will release soon a tool for checking support for this and all the other optional bits of a Direct3D 11 driver.. .
Currently AMD supports it in 58xx series but due to buggy Microsoft Shader compiler it's not currently feasiable to work with..
Also by the same reason as before don't expect DirectCompute 4.0 enabled drivers to expose double support on 48xx cards..
(see pot beyond3d..)
It's expected that this get fixed in the next release (say December 2009) or the next (say March 2010 for GDC 2010)..

Note that double precision support in DirectCompute is expected to work simply declaring variables as doubles..

OpenCL
======

Currently in OpenCL 1.0 double precision support it's also an optional feature exposed as and extension (cl_khr_fp64).

To enable support for it you need to declare support for it in you kernel code:
#pragma extension Opencl: cl_khr_fp64

Currently there is only one CPU implementation and one GPU device supporting it and that is CPU:Intel CPUs with SSSE3 (Core2 and higher,Phenom..) on Apple implementation in Snow Leopard.
GPU: Nvidia 195.39 driver

Target avaiability on other platforms is more fuzzy an all that can say AMD engineers is that it will be avaiable through 2010..
All this is related to broad math library which is exposed also as part of this extension and which carries a lot of functions with very strict precision requerimients. In fact don't expect all of this functions to be supported directly by any hardware implementation soall of it has to be coded, tested, and validated which in fact takes time.. Hopefully AMD engineers have said that expect support to come gradually.. and perhaps we can expect to be able to add, substract, multiply double precision values by the end of the year.. but it seems it will be hacky since you couldn't expect the extension to be reported as supported..

Note that also in previous betas of AMD SDK at least for the CPU it was possible to denote variables by float but it will be promoted to floats, now it currently fails
saying you have to use:
#pragma extension Opencl: cl_khr_fp64
and if you use then it's said it's not supported.

I will expect Fermi to come first with support with this extension (at least full support) as Nvidia is claiming so much strong support in Fermi and also because of Nvidia advantage on exposing OpenCL extensions at the moment.. Perhaps having luck we can have towards very late 2009..

UPDATED: Nvidia supports in 195.39

OpenGL GLSL
===========

Yeah, even today, in GLSL, one way of the venerable ways of doing GPGPU computations, it possible to doing computations in double precision support.

First remember that in GLSL, and by the way also in OpenGL ES 2.0, there is a precision qualifier for variables. Also note that double is a reserved keywork of the language at least in latest incarnations, so the road to support it is well prepared. In fact for the then capped Longs Peak or anyway the full redesign of OpenGL, double precision support was supposedly coming (say between mid 2007-mid 2008), but then was finally get out of the plans.
Anyway AMD supports double precision in GLSL shaders in her cards supporting it (see CAL section for more info), altough by any concievable reason, it has not been fully publicited. In fact I found it when reading some russian forums which in turn date the support back to March or April 2009 (so in 9.3 or 9.4)..
Ok way back then I added support for a Mandelbrot program I have been given by my brother wich supported very efficiently single precision support..
In fact the miraculous words wery by then to use this precision qualifier:
__doublepAMDX
so instead of a float variable temp:
float temp;
use:
__doublepAMDX float temp;

similar for vecs:
vec2 pd;
to:
__doublepAMDX vec2 pd;

Note that AMDX denotes an AMD experimental extension, when I checked support for it in 58xx (and 9.10 or 9.11) the shader failed to compile,
now the correct way is: __doublepAMD. So seems to have left the experimental stage..

Note that is support is more broadly than I have told you in fact there are functions to pass at least double precision values to uniforms also..
In fact this is necessary in my Mandelbrot shader to pass the windows coordinates of the zoomed area or a similar value since if not the when you zoom pass single precision allows you to, then as quality is not afected by computing in double precision the precision of zooming it is..

We think also that Nvidia is going to add double precision in shaders (GLSL) soon as some header from Nvidia Linux drivers come with:
ext_gpu_shader_fp64

Another topic is when that support will get standarized (as EXT or ARB extension or promoted in the core specification..) Hopefully for Direct3D 11 like OpenGL say OpenGL 4.0 in 1 to 2 years time..

UPDATED: Nvidia confirmed for Fermi (GTC videos) seens in 195.39
Emulating double support:
=========================

The last change you have when a feature is not supported is trying to emulate.
Fortunately this is possible to add by using two single precision values (GLSL functions):

vec2 dblsgl_add (vec2 x, vec2 y)
{
vec2 z;
float t1, t2, e;

t1 = x.y + y.y;
e = t1 - x.y;
t2 = ((y.y - e) + (x.y - (t1 - e))) + x.x + y.x;
z.y = e = t1 + t2;
z.x = t2 - (e - t1);
return z;
}

vec2 dblsgl_mul (vec2 x, vec2 y)
{
vec2 z;
float up, vp, u1, u2, v1, v2, mh, ml;

up = x.y * 4097.0;
u1 = (x.y - up) + up;
u2 = x.y - u1;
vp = y.y * 4097.0;
v1 = (y.y - vp) + vp;
v2 = y.y - v1;
//mh = __fmul_rn(x.y,y.y);
mh = x.y*y.y;
ml = (((u1 * v1 - mh) + u1 * v2) + u2 * v1) + u2 * v2;
//ml = (fmul_rn(x.y,y.x) + __fmul_rn(x.x,y.y)) + ml;

ml = (x.y*y.x + x.x*y.y) + ml;

mh=mh;
z.y = up = mh + ml;
z.x = (mh - up) + ml;
return z;
}

This two functions add and multiply two values (one using .xy and other .zw)..
The good part is that double precision can be initialized from single precision values copying it to the first component (.x and .z) and other to zero (.y and .w).

Examples:

1. Initializing:
posd.y=position.x;
posd.x=0.0;

2. Adding and multipling 2 DP values:

dblsgl_add(posd.xy,vec2(0.0,offsetX));
dblsgl_mul(zoomv,posd.zw);

3. Substracting one from other is easy using -y:
dblsgl_add(x,-y);

Link cuda forums:

Using similar ideas albeit more complicated you can also get quad precision support from 1 vector of 4 single precision values.
Search:

The performance in Radeon 5850:

Emulated double: 15fps
Double: 35fps
Float: 127fps

In GTX 275: (TBD)

So at least in Mandelbrot you can expect a 2.33 slowdown by emulating it.
Note that also double performance vs float comes as 3.6 slowdown.
Finally note that in Nvidia GTX 275 where float to double slowdown is going to go up to 8x then perhaps emulating it through single precision is near in performance to native precision..

Also one thing to note regarding using Nvidia vs AMD GLSL compilers is that Nvidia one makes arithmetic simplifications wich are required to be not done to achive the emulation.. currently to overcome that we use Cg runtime which supports disabling math optimizations to shaders via passing flags to the shader compiler and also seems to be eating well GLSL kernels compiling it with cgCreateProgramFromFile and selecting GLSL CG_PROFILE_GLSLV, CG_PROFILE_GLSLF profiles and using CG_OBJECT and
using as parameters {"-oglsl","-bestprecision"}. Here -bestprecision is the flag to disable math optimizations and with it double precision works like a charm..

Note that this way doesn't work correctly on AMD cards so we must use one way for each card..

This last observation makes us see that Nvidia GLSL compiler is more clever at optimizing than AMD one albeit that's not good this specific example..

Using doubles with textures
===========================

At least with CUDA it possible to pass double arrays with textures specifiyng as int2 textures without filtering..

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Sunday, 25 October 2009

Double precision support in GPU computing APIs and GPUs and emulating it..

0 comments:

Post a Comment

Popular Posts

Blog Archive

About Me