*Larrabee perf exposed on matmul and sparse matvec mul..
On the SGEMM single precision, dense matrix multiply test, Rattner showed Larrabee running at a peak of 417 gigaflops with half of its cores activated (presumably the 80-core processor the company was showing off last year); and with all of the cores turned on, it was able to hit 805 gigaflops. As the keynote was winding down, Rattner told the techies to overclock it, and was able to push a single Larrabee chip up to just over 1 teraflops, which is the design goal for the initial Larrabee co-processors.*Improved Nvidia GPUs and Infiniband interop: it allows to use pinned mem for both GPU and Infiniband devices (Mellanox drivers and CUDA release around Q2 2010).. avoids copy on host mem.. or avoiding pinned mem.. still lacking general way of using GPU DMAs to send to other DMA devices
Here's the next problem. Sparse matrix math is what is commonly needed in simulations involving cloth and water. And on that test, a Larrabee chip that was not overclocked was able to do between 7.9 and 8.1 gigaflops, depending on the test and the size of the matrices.
But what he did say is that the Ct dialect of C++ that Intel has created will be going into beta soon to help with the parallelization of C++ code to run on multicore and multithreaded processors, and more importantly, to spread code across CPUs and GPU-based co-processors in workstations and services to maximize performance as transparently as possible. Ct will work in conjunction with the CUDA environment from Nvidia for its GPUs and for the OpenCL environment being pushed by Advanced Micro Devices and others.
Intel is also cracking the issue of sharing data between Core and Xeon CPUs and Larrabee GPU co-processors. Future Core and Xeon chips will be able to create a virtual shared memory pool that both the CPU and GPU can access so datasets are not crunched down, serialized, and moved over the PCI-Express bus from the CPU to the GPU and then back again after calculations are done. The shared virtual memory allows the CPU and GPU to work off the same data in sequence without any movement, which should radically improve performance and smooth out simulations.
*CUDA 3.0beta and drivers public..
*OpenMP work towards 3.1-4.0
*Magma 0.2 released without source (expect in december) still no OpenCL support..
* LU, QR, and Cholesky factorizations in both real and complex arithmetic (single and double);See:
* LQ and QL factorizations in real arithmetic (single and double);
* Linear solvers based on LU, QR, and Cholesky in real arithmetic (single and double);
* Mixed-precision iterative refinement solvers based on LU, QR, and Cholesky in real arithmetic;
* Reduction to upper Hessenberg form in real arithmetic (single and double);
* MAGMA BLAS in real arithmetic (single and double), including gemm, gemv, symv, and trsm.
http://icl.cs.utk.edu/projectsfiles/magma/pubs/MAGMA-BLAS-SC09.pdf
http://icl.cs.utk.edu/projectsfiles/magma/docs/magma_roadmap.pdf
*Cula 1.1:
Here is a subset of the improvements that have made it into this release:eigensolvers in pro version
* Exciting new functions including general Eigensolver (Premium Feature)
* Bridge interface for migrating currently existing LAPACK/MKL code
* Better documentation including a full API reference
* New examples constructed from user feedback
* More performance!
* Mac OS X support (Preview)
now supports Mac though only Leopard and single precision: what about Snow Leopard and double precision?
*OpenMM 1.0beta released: OpenCL preliminary support.. still no binaries with it!
This release adds support for Particle Mesh Ewald, arbitrary forms for non-bonded interactions, and preliminary support for OpenCL.*Apple OpenCL FFT lib: seems very high perf. only Mac..perf issues until 10.6.3?
Currently supports 1D, 2D, 3D batched complex-to-complex transforms (inverse and forward) both in-place and out-of-place transforms.*gpu-z 0.37: Shows DirectCompute (supported version also) and OpenCL check boxes.. OpenCL ati is not detected..
Using plannar and interleaved data format but current only supports transform on GPU device. Accelerate framework can be used on CPU.
Current version supports sizes that fits in device global memory although "Twist Kernel" is included in fft plan if user wants to virtualize (implement sizes larger than what can fit in GPU globalmemory).
*gbench 1.0 released based on Matlab jacket product similar to matlab bench builtin func and works wothout Matlab also..
Checks FFT, Dense blas, bench..
Benchmarks include six different tasks, common to the technical computing community:*3D Vision news:
1. LU: LU decomposition of 1024 x 1024 matrix
2. FFT: Fast Fourier Transform of a 2^20 x 1 vector
3. BLAS: Matrix multiplication of two 1024x1024 matrices
4. 3D Conv: Convolution of 64x64x64 array with 3x3x3 kernel
5. FOR/GFOR: Matrix-vector multiplication of 1024x1024x32 array
6. Equations: Solution of a system of 1024 equations
->Avatar demo with 3D Vision builtin is impressive tough goes from 60 to 20 fps
though have to use d3d10 path 9 seems fixed in 195.62
->3D vision on Linux supported for quadro cars on 195.22 (quadro only and requires mini din connector and connected before x starts no hotplug)
-> 3D vision 195.55 and higher ship with browser plugins (IE,firefox) for 3d photos and also upcoming windowed support.. see tweaktown..
*Nvidia released 195.62 WHQL candidate and 195.22 for Linux public..
*AMD released 9.11 WHQL CAL supports OpenCL
*Direct3D 11 benchmark for Stalker
*PGI 2010: CUDA fortran and accelerator model for Windows and MAC and stable for Linux
*Khronos OpenCL BOF presentations posted: especially interesting are LANL pdf showing perf of molecular code of VMD (electrostatic potential) on both Intel SSE multicore,OpenCL (CPU,AMD,NVIDIA and also Cell)..
What you learn:
shows perf issues on Cell about lacking __constant and how to overcome this..
shows tables of perf of all this arch.
points key issues in OpenCL right now
Fermi as a GPU:
http://techreport.com/articles.x/17815
Posters about GPU computing
of GTC
of SC09
Porting a efficient bit library in CUDA (with preliminary perf)
http://bmagic.sourceforge.net/bmcudasse2.html
Implementing integer multiprecision in OpenCL
on cuda: "Implementation of Multiple-precision Modular Multiplication on GPU" Kaiyong Zhao
see poster:
http://www.nvidia.com/content/GTC/posters/87__Kaiyong_Implementation_of_Multiple-precision.png
There are also work by on Daniel Bernstein Elliptic curves and also on RSA both in Eurocrypt 2009 conference..
interested also are mpir gpu
Source code of DCGN – Message Passing on GPUs released (old news):
http://jeff.bleugris.com/journal/2009/06/02/looking-for-dcgn/
http://jeff.bleugris.com/journal/projects/
know that I don't know if code is updated but if not is somewhat bad since CUDA 2.2 introduced pinned host mem for GPU accessing that an avoiding polling the CPU and doing cudamemcpy gpu->cpu for inspecting if GPU has new things to do.. now polling is done on CPU mem and GPU writes to CPU mem..
fem codes on CUDA:
http://sites.google.com/site/monkology/gpuprogramming-project3-final
papers/posters:
fluid on GPU by Michael griebel as poster on GTC09
indexing the internet with gpu (cuda zone)
Posted on Apple OpenGL forums:
Here is a simple example that uses GLUT, it reads a png image (arg1) creates a source and dest texture/image, then uses a kernel to clip out the red.
example of simple cg-gl interop
This is interesting a year ago I was searching on bulding WRF on Windows.. there was some efforst some years ago but overall it's was a hacky port and also with old base code..
This was for testing WRF perf of CUDA ports of physics microkernels WSM5..
tere is a web page:
now PGI has done my dreams come true and provides a very clean patch file for latest WRF (3.1.1) for compiling on latest PGI compilers.. I think 9.0-4 or higher but now 10.0 should also support it.. anyway it's good news for Windows users and I want to obtain a VS2008port from this.. it may need some work for lot less than ever.. see in PGI October newsletter..
"Porting the Weather Research and Forecasting Application to Microsoft Windows Using PGI Workstation"
Also is good to know that this has been dome for the same purpose as I wanted.. to test WRF working on GPU now with the Accelerator model..
There is another article on the same newsletter..
ATI 9.12beta (8.68) only XP
includes ATI CAL 1.4.492 vs. OpenCL beta4 CAL (1.4.467)
Windows guest drivers for KVM
http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers
Virtual texturing demos
http://linedef.com/personal/demos/?p=virtual-texturing
Hierarchical voxel rendering demo
http://linedef.com/personal/demos/?p=hierarchical-voxel-rendering
0 comments:
Post a Comment