Support 3d image write on CUDA and with OpenCL wrapper ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

One of the exciting things coming this year fall is that new and updated GPU Computing APIs and hardware will provide native support for executing 3d simulations..

By that I mean simulation running on 3D grids such as 3D wave equations, Kirchoff reverse time migration, 3d fluid simulations, etc..

Until now this has been as tricky as running GPGPU programs with GLSL i.e. it wasn't naturally supported by hardware. Let me expain in more detail why I think so:

First notice that things as 2D simulations have been always naturaly placed for hardware as 2D textures, renderbuffers or framebuffers are naturrally 2D arrays of data supported by hardware.., also 2D textures also present some great features which can be nice for 2D similations:

* First is that they provide in hardware (named texture samplers) clamping and filtering modes which can avoid doing special code for treating boundaries in simulation code and also use the additional ALU power exposed by texture filtering hardware (i.e doing limited linear combination on neighboring values..)

* Second is that they have special cache resources devoted to improve accesses with 2D spacial coherence and as such they can provide more bandwitch than expected in this situation.. (say getting the same value and the 4 boundary values)

* Third is that they support addressing by (x,y) coordinates obviating to use ALU power to transform from linear mem to 2D coords: p[x+y*width].

Textures in principle were thought as read only resources (by GPU) and were only written to in CPU code, but with tecniques as render to texture and FBOs (in OpenGL world) they were allow to be written by shaders.. Also with MRT (multiple render targets) i.e. binding multiple textures for renderers to write to it were possible..

Now as fragment shaders are naturally executed as elements of a 2D grid we can naturally (i.e. with high performance) expose a 2D wave simulation using a 2nd order FTD in time for example with three texture (2D Arrays) of the size wished for the simulation (max 8Kx8K for Direct3D 10 and 10.1 hardware and 16Kx16K for D3D 11 hardware)..
Then every simulation step we setup a FBO and atach 1 texture as framebuffer for rendering (write to texture) while the other two are passed to the shader as readonly texture and then we interchange textures..
The shader has only to run the FTD code..
Note in that case we don't need to use MRT at all..
Better yet we don't need three textures for that example as we can simply use 1 texture also with three channels (components) i.e. a RGB texture.

Please also notices how big are getting the texture limits.. In fact a 16K^2 texture with RGBA byte colors is of size 1Gbyte.. or for a simulation a R (1 channel) float texture is of the same size.. So just it would fit in a current top end card as ATI 5870 with 1Gbyte RAM. Note that a 8K^2 of a wave simulator would just fit (assuming texture with 3 float components don't size 16 bytes per pixel)
Also as I remainder I often think that if I would do a realistic flight simulator and I would just use as elevation data (DEM) currently the highest resolution avaiable to public (SRTM) is of 90m between samples so a 16K^2 texture would cover and area of 1500Kmx1500Km greater than my country (Spain which wieghts 1000kmx1000km approx.) and would weight assuming 1 component which 16 bit floats 512MBytes and which would provide also up to 4m. error as Everest has 8000m approx. and it has 11 bits of precision.. but in my country up to 2 meter resulution as it has mounts less than 4000m.

Also note that also exist 1D textures (3D textures stuff later..) and have limits
more higher in fact I think are up to 2^24 pixels which in fact can be up to 256Mbytes (using 4 components floats)..

Notice also that this kind of simulations is well mapped also because it can be written in a form that every simulation point can be calculated from others.
Anyway I mean is a kernel doing on gather and not scatter.

In fact all this stuff (2D simulations exposed as gather things) go well in this arcane GPGPU world..

Then come CUDA that among other things exposed the avaiabilty to do scatter..
i.e. writting to arbritrary position in kernels.. In fact also memory was also exposed a byte addressable one at least in this model. This one (general access to memory i.e. scatter and being byte addressable) is one of the three key innovations of CUDA among the avaiability of grouping threads in groups with the avaiability to sinchronize between them (second one) and use fast shared mem between them (third one).

All in all we can say this doesn't brought dramatic changes (in concept) to a 2D wave simulation (as threads in a CUDA grid also are grouped as a 2D grid(see more later..)) solver perhaps other than grouping threads for cooperatively reading blocks of memory and reading from that which would surely lessen memory bandwith..
In fact, altough not would be very efficient, it's possible now with CUDA and global mem atomics (in CUDA 1.1 December 2007) to express the wave kernel as an scatter one where a position updates attomically all other ones it affects (which equals to incrementing attomically a memory reference).. Better yet (with CUDA with compute support 1.2 in CUDA 2.0 (june 2008)) we can update atomically from shared mem..

Up to now CUDA as 2D array we are using "normal" global linear memory and forgetting about textures.. and we lessen memory bandwidth usage by using shared memory but we must use ALU power to transform from (x,y) addressing to linear position in memory and viceversa and also lost the use texture cache optimized for 2D usage.. We must say at this point this is the classic CUDA tradeoff of using "computing mode" vs "graphcis mode".. In fact we also lose also the power of the ROPs and rasterized in CUDA mode.. Anyway since CUDA 1.1 textures had also been present but only as read only days as in OpenGL days before all the render to texture tecniques(via pbuffers (2002-2003) and FBOs(2005 and up).. this was up to CUDA 2.2 (may 2009) where texture from pitch linear memory functionality was added (see DDJ article CUDA for the masses)..

With this you can allocate from linear memory memory that with special (x,y) access (taking care of pitch) you can write from kernels and then bind directly to CUDA textures (without copies I think..) to use in kernels as read only but avoiding (x,y) address calculations and using texture caches.. Also presumably (I feel like a speaker with more intuition which in turn can be bad than experience some times..) you don't need to use shared mem for improving mem bancdwith and use that for as cache for other things in the kernel..

All in all this is all we can do up to now (in CUDA world ) and that is very good..

In a 2D world we can say perhaps the "biggest" limitation right now comes if we want to display the simulation results using graphics API i.e. interoperabilty with the graphics APIs.. You would tell me that CUDA supports interop with OpenGL and since CUDA 2.1 (june 2009) OpenGL interop is pretty efficient doing copies in GPU memory before then it was going through system memory..
and that's true but also the CUDA texture has to be passed through a PBO.. and then from that copied to OpenGL texture..
Note that this is going to be improved I think soon (since was as an error said it was supported in CUDA 2.2 texture buffer object interop)..
I'm also assuming that there are no issues using OpenGL interop with textures created from pitch linear memory (I strongly believe that but can also be wrong and/or things could be worse when direct texture interop is present..)
Also be aware that to use interop from a running kernel from linear mem would have to create a texture object eitherway a CUDA tex object and use interop as explained or sending data to CPU and pass data to gl(Sub)TexImage2D.. eitherway CPU transfers are involved..

Also remember that DirectX interop with textures seems better and to not need device copies (i.e. same mem is used in CUDA and Direct3D)..

Also remember that another important issue can be also GPU switching between kernels (in this example to switch between CUDA mode and OpenGL mode)..
I recall of reading in Nvidia forums and employee saying to not switch if possible more than few times every frame and that just makes sense now than Fermi is said to improve this switch time by 10 times.. but still in the order of 10s of usec TODO..

Also remember that graphics interop is one of the key points of the very latest GPU Computing APIs (DirectCompute and OpenCL) that enable using the same resources within graphics shaders and within compute shaders avoiding copies..

Also another issue can come in MultiGPU scenarios which one GPU is used for simulating and onther one is used for displaying..

Special care has been taken very recently by both Nvidia and AMD for MultiGPU scenarios..
First Nvidia added in CUDA 2.3 "improved efficiency using Tesla cards for computing and Quadro for displaying" pretty vague.. bah, .. and that's for rich people you would say.. ok, go lower level and read new extensions..

In subsequent Forceware 190 series two extensions have recently added:

GL_NV_copy_image (july 2009)
GL_NV_texture_barrier (august 2009)

First is THE ONE for supporting efficient texture transfers between devices:

"This extension enables efficient image data transfer between image
objects (i.e. textures and renderbuffers) without the need to bind
the objects or otherwise configure the rendering pipeline. The
WGL and GLX versions allow copying between images in different
contexts, even if those contexts are in different sharelists or
even on different physical devices."

Note this is multiOS (linux and windows) and that you can expose also rectangles subregions to copy via CopyImageSubDataNV..

I think this feature is what is used for Tesla+Quadro efficient interop or at least in the new Quadro Digital Video pipeline..

But also remember that this exposes OpenGL multiGPU interop not CUDA multiGPU interop and this would require to interop the OpenGL texture with the CUDA
texture.. which is coming as said above..

Now return to our OpenGL only GPGPU world:

Let's talk about the other new extension GL_NV_texture_barrier:
"

This extension relaxes the restrictions on rendering to a currently
bound texture and provides a mechanism to avoid read-after-write
hazards

"

So this extension provides also the ability to read simultaneously from textures
being written.. but it's rather limited.. but for example is good with this extension to read from the texture the current value and then writing to it based on this value and possibly another of other textures..
So this is really good in fact for our 2D wave equation solver if we were using three textures we could really use two.. updating t-2 texture to t texture and t-1 becomes t-2 texture.

Also good for OIT with shaders supporting scatter (see below for shader scatter and a upcoming OIT article)..

Also this example use of the extension seems good:

Another application is to render-to-texture algorithms that ping-pong
between two textures, using the result of one rendering pass as the input
to the next. Existing mechanisms require expensive FBO Binds, DrawBuffer
changes, or FBO attachment changes to safely swap the render target and
texture. With texture barriers, layered geometry shader rendering, and
texture arrays, an application can very cheaply ping-pong between two
layers of a single texture. i.e.

X = 0;
// Bind the array texture to a texture unit
// Attach the array texture to an FBO using FramebufferTexture3D
while (!done) {
// Stuff X in a constant, vertex attrib, etc.
Draw -
Texturing from layer X;
Writing gl_Layer = 1 - X in the geometry shader;

TextureBarrierNV();
X = 1 - X;
}

However, be warned that this requires geometry shaders and hence adds
the overhead that all geometry must pass through an additional program
stage, so an application using large amounts of geometry could become
geometry-limited or more shader-limited.

In fact this extensions explains how 2D simulations running needing more than 1 texture can optimize performance by being attached as a texture array and using a geometry shader for selecting wich layer to write to (reducing texture ping pong time), and using TextureBarrierNV(); for assuring texture updated state..

All of this is related to OpenGL and in first case MultiGPU.. but please recall that in CUDA for MultiGPU scenarios perhaps you can do two another things:

Assuming you don't need graphics visualization (interop) the best thing (at least for the 2D wave simulation) theoretically it is to split load and to work with pinned mapped system mem (which is a linear mem) shared between devices (in turn pinned system mem feature also is a feature of CUDA 2.2 (avaiable on CUDA 1.2 compute devices and upper) which allows very important things:

* That the GPU operates directly on system memory directly (read and write) and transfers are done without CPU intervention via GPU DMA buffers at very high speed
(up to 80% PCIExpress theoretical bandwith)
* Avoids explicit mem transfers. Transfering when you need and using all the memory latency hiding techniques avaiable in the GPU arch. (i.e. using execution resuorces while waiting for mem to be avaiable)
* Said to improve performance in Vista Windows 7 with WDDM drivers on which Windows manages GPU mem ops (said by Nvidia engineer)..

Note 1: pinned memory refers to transfering data with DMAs without intervention of the CPU achieving very high speed transfers and was since CUDA 1.0.

Note 2: CUDA 1.1 added executing kernels and doing mem transfers independently without CPU usage.

If you think about it system pinned is more fine grained than "simultaneous kernel execution and H2D or D2H transfers" as it's can execute kernel as mem requests are satisfied contrary to waiting for full transfer and also that it allow this benefist in the model of:
1.trasfer mem data to device
2.execute kernel using this data
3.transfer mem data to host
while for using "simultaneous kernel execution and H2D or D2H transfers" you have to create two or more streams (or command_queues in OpenCL parlance) of this model.

Note 3: Fermi and CUDA 3.0 will add bidirectional simultaneous transfers to the mix which would in theory be usable by both system pinned mem and using streams.
Note: Fermi and CUDA 3.0 will also simultaneous execution of the kernels
which if you think is only usable with multiple streams and allows arbitrary simultaneous execution of any two steps from two streams one of each stream:
K-M (since 1.1)
M(H2D)-M(D2H) (CUDA 3.0)
K-K (since CUDA 3.0)
What I don't know for sure is if two memory transfers of same direction (D2H or H2D) will be executed in parallel (anyway this will not improve things as PCIE bus has a bandwidth in each way that is very good used by a simple stream at least as certain minimum size apply.. see DeviceQuery in shmoo mode)

Anyway one important thing also present in CUDA 2.2 is the "shared" thing in shared pinned mem. Shared refers to pinned host memory being pinned for more than one device (being or not mapped)..
With this feature the same host mem (where 2D wave simulation date stays) can be used from DMA engines from multiple cards.. before that a host area was pinned only for one device but not the other so say if one device needed to transfer data from one GPU to another one way (to host) goes at 5.5Gbytes/s and to the another device perhaps at 3Gbytes/s (well with Nehalem brutal memory impormevents non pinned memory goes well at above 5Gbytes/s)..
Note this also not would be very bad for fluid simulation as each GPU would allocate a region as pinned mem and only transfer more slowly boundary data (say 1 row for example it depends on the order of the discretization in y axe for example..)
Adding shared system mem to the mix now each GPU transfers at full speed and without CPU intervention for coordinating the steps:

in parallel{
GPU1
{
while()
{
does one step and increments byte1 by 1 modulo 3 in shared system mem when finished
waits for byte2 set as byte1
}
}
GPU2{
while()
{
does one step and increments byte2 by 1 modulo 3 in shared system mem when
waits for byte1 set as byte2
}
}
}

Now return to OpenGL only world.. remember we said before that we can run in CUDA
mem scatter kernels thank to the avaiability to scatter operations and atomic operations (but only in linear mem).. this are not supported in OpenGL.. well not exactly one limitation is already going be removed soon.. in fact a OpenGL extension shows up in Catalyst 9.10 drivers and Radeon 5850:
AMDX_random_access_target
and have info from AMD that DX11 OGL extensions are planned to get released (documentation about extension) by early 2010..
And Nvidia is also not sleeping with its OGL extensions for Fermi as also
shader model 5 extensions have been leaked unintentionally I think, on a header of some Nvidia Linux driver in late May 2009..
(in fact there are more see post copied from my post on OpenGL forums..)

This will surely add adding renderbuffers and/or textures (possible also 1D Buffers) which are writable to random positions in addition to MRT wich allows only writing to the implicitely fragment position given to the shader.. This would allow writing to 3D textures also (more later).. This will in fact will also allow doing OIT in a OpenGL world using OpenCL for some other things and interop to use data from one API in the other.. (find more in another post)

But for running our scatter based kernel we need atomic operations.. well,
based on Direct 3D Shader Model 5.0 and haven seen that AMD is in the works of
exposing gl_AMD_gpu_shader5 seems that atomics to fragment shaders are coming also.. as said that at least is exposed already in Direct3D 11 Shader Model 5.0 which allows graphics shaders (at least fragment shaders) to have buffers/textures which in addition to being writable to arbitrary positions (in fact there a R/W Byte Buffers which are byte addressable as linear mem)..
In fact Direct3D Shader Model 5.0 allows more, in fact fragment shaders can use atomic operations (and possibly other types of fragments)..

So in fact in Shader Model 5.0 a scatter kernel of the 2D wave equation seems possibly to do.. Note that graphics shaders still lack the concept of groupings and consequently access to some "local mem" or synchronization between them..

Finally note that with DirectCompute 5.0 you have all the needed things of CUDA (at least for single GPU usage) so you have groups, shared memory, atomics to both global and local memory, and textures which in turn can be writable with random access if desired.. So all that would have to be investigated is in MultiGPU scenarios.. Also graphics interop is there..
To close one thing that keeps me intrigued is the Anandtech 5870 review were they run a Compute Shader 4.0 demo (ocenCS demo form nvidia sdk) and is able to run in MultiGPU and improve performance.. I have to see if it has multigpu support builtin or is in AMD drivers.. also see if is the code how is coded..

Similar in OpenCL has the same features altough you will need extensions so you have:
so you have groups (core), shared memory (core if local mem is reported>0), atomics to both global (a group of extensions global atomics) and local memory (extensions local atomics), and textures (check IMAGE_SUPPORT), byte addresable mem ( byte addresable ext.) which in turn can be writable with random access if desired..
Note that all features are presently in Nvidia OpenCL implemenation on CUDA SM 1.1 devices and on should AMD 58xx cards over time (currently none is present)..
(note atomics are not supported in 48xx and also byte addressable is not but IMAGE SUPPORT should be there over time.. local mem is tricky as is restricted to no arbitrary writing)
Graphics interop is not present in implementations see my last post..

Well before switching to 3D Wave simulation let's talk about ATI some more:

CAL has texture support
ATI CAL has similar graphics interop:
TODO

Having talked of the l

see CUDa forums
See wawes 3d
also optimize 3d access for warps es decir funcion que calcul
adresas con minima instruction

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Saturday, 24 October 2009

Support 3d image write on CUDA and with OpenCL wrapper

0 comments:

Post a Comment

Popular Posts

Blog Archive

About Me