lin alg status update:
Matmul:
CUDA: CUBLAS (no code) Volkov (code) and yesterday post (assembly fastest to date 480 gflops)
CAL: beyond3d cal 1tflop matmul post
OCL: hazeman post above uses port of cal code to propietary but similar to CL code..
DC: bernaclejunior testing with doubles doesn't worked (XNA forums)
Matvec:
CUDA: CUBLAS (closed) and some papers use custom code (magma, paper mid 2008) (as 20-50% faster)
OCL: Bealto post above (high efficient on AMD and ATI) should be easy to port DC
Sparse matvec:
CUDA: CNC,CUSP,etc..
OCL,DC: BernacleJunior post on AMD and XNA forums (working on it)..
FFT:
CUDA: CUFFT 2 papers at SC08 having higher perf 3d ftts and 2d paper ->d3dCx
DC: has lib
OCL:
Apple code is 2x-3x slower than CUFFT seems (on Nvidia Linux )(also 10.6.2 is slow go see 10.6.3..)
on AMD doesn't work for size >512^2 in 2.0 or 2.01 fixed internally seems..
AMD 2.01 sample is hard coded 1024 perf?
Sort:
CUDA: CUDPP, CUDA sample (code)
OCL,DC: BernacleJunior post on AMD and XNA forums.He claims near 400Mkeys/s on vs state of the art Nvidia sorting less 200mkeys on GTX285.
Also reportedly Lee Hows has fast code working!
also CUDPP has triangular solvers and soon graph algos and hashes..
Friday, 19 February 2010
Parallel algorithms avaiable on CUDA,OCL,DC,CAL: status update
Posted on 08:37 by Unknown
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment