I hope you're interested in measuring OpenCL perf.!
I expect to have some updates on my blog that would concern your interest!
Here is things I plan to do this coming weeks:
In case you're interested in measuring OpenCL perf on key scientific kernels I will try to test:
*Peak flops: What perf I'm able to get by using a kernel of mads (multiply and add) in integers, floats, double's and integer with 24 bits of precision..
*Dense linear algebra:BLAS code (matmul for example): linear algebra perf... Linpack uses BLAS calls..
*FFT code (fast fourier transforms): Useful in image processing/ compression and scientific simulations
*Sparse linear algebra: sparse matrix vector product
*Multiprecision int: cryptography
I'm posting soon (hope this week or next) benchmarking code for getting FFT (basically a port for Windows of Apple FFT library .. http://developer.apple.com/mac/library/samplecode/OpenCL_FFT/index.html) and exploitable gflops peaks (in single, integer, integer24 and double operations)..
Also perhaps this month some integer multiprecision perf and sparse matrix perf (conjugate gradients)..
Also may concern you I have posted already some old (but still interesting) benchmark of a very high efficient code for testing matmul perf. (in CAL, CUDA, and multicore SSE)
http://oscarbg.blogspot.com/2009/11/matmul-bench-for-cuda-cal-and-multicore.html
This tool is of much interest now note that matmul perf. of Larrabe was unveiled at SC09 by Rattner a week ago to be near 800Gflops using my tool on 5850 i get nearly 750Gflops..
Nvidia GTX 275 only 400-450Gflops (Fermi will double that I hope).. so seems all GPUs are currently similar in matmul perf..
Currently matmul perf on OpenCL seems to be low (at least 2x slower..) .. but that can be to currently avaiable OpenCL code in SDKs not optimized for either Nvidia and AMD..
I will try to get an efficient port at least for Nvidia before end of the year.. Hopefully also efficient for CPUs and AMD GPUs..
Sparse matrix multiply of Larrabee also was unveiled to be 8Gflops..
Note but that specific sparse matrix format was not unveilled (that can have an impact in perf.).. see current most efficient implementation on Nvidia GPUs (CUDA based) http://www.nvidia.com/object/nvidia_research_pub_013.html
the code is here:
http://cusp-library.googlecode.com/files/sc2009_spmv.zip
I will try to get binaries for Windows testing..
Currently you can test a efficient code in CUDA of this kernel using "Concurrent Number Cruncher" from Inria.. well it won't work because of compiled in CUDA 1.0 and doesn't work (altough CUDA seems to be binary compatible in last releases and it's meant to be it seems that binaries compiled for CUDA 1.0 aren't) so you have to recompile with
new CUDA libs.. I have these new binaries I can supply you with..
I'm getting between 3-5Gflops but note these kernels are bandwith bound no computation bound, so it really depend of memory bandwith and perhaps of memory perf..
so Fermi with cached global mem will get possibly more than 2X increase.. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
OpenCL limited BLAS and conjugate gradient seems seems to be avaiable from:
http://sourceforge.net/projects/openclblas/
I have to test tough I remember seeing in a blog that this is Nvidia optimized..
Regarding OpenCL integer multiprecision you have an excellent tutorial with code snippets here:
http://www.bealto.com/mp-gpu.html
Note he benches also arithmetic kernel and mem perf. with latest 195 drivers.. (Nvidia only)
Later I will try to compile OpenMM 1.0beta with OpenCL support and bench it.. (moluecular code kernel of Folding@Home)
Note the binaries provided are CUDA capable only.. https://simtk.org/home/openmm
Note I'm providing with the latest things I'm aware of..
Wednesday, 25 November 2009
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment