Testing LDS perf in OpenCL! ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Found on AMD forums:

I ran this kernel :

Code:
__kernel void ldsReadBandwidth(__global float4 *output)
{
__local float4 lds[128];
float4 val = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
val = val + lds[0];
val = val + lds[1];
val = val + lds[2];
val = val + lds[3];
//......
val = val + lds[127];
output[get_global_id(0)] = val;
}
and subtracted following kernel's execution time from the above one to eliminate adds and writes to global memory :

Code:
__kernel void ldsReadBandwidth(__global float4 *output)
{
float4 val = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
val = val + val;
val = val + val;
val = val + val;
val = val + val;
//......
val = val + val;
output[get_global_id(0)] = val;
}
Result : I am getting 540 GB/s on Radeon 5770.

That's a bit disappointing, as it's only 1 float per Vec5 unit per clock. In ATI's counter-Fermi presentation they stated the 5870 could access LDS at 960 floats per clock, i.e. 3 per Vec5 per clk.

A question :Does the performance stay the same if you add writes to the LDS?

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Thursday, 26 November 2009

Testing LDS perf in OpenCL!

0 comments:

Post a Comment

Popular Posts

Blog Archive

About Me