Found on AMD forums:
I ran this kernel :
Code:
__kernel void ldsReadBandwidth(__global float4 *output)
{
__local float4 lds[128];
float4 val = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
val = val + lds[0];
val = val + lds[1];
val = val + lds[2];
val = val + lds[3];
//......
val = val + lds[127];
output[get_global_id(0)] = val;
}
and subtracted following kernel's execution time from the above one to eliminate adds and writes to global memory :
Code:
__kernel void ldsReadBandwidth(__global float4 *output)
{
float4 val = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
val = val + val;
val = val + val;
val = val + val;
val = val + val;
//......
val = val + val;
output[get_global_id(0)] = val;
}
Result : I am getting 540 GB/s on Radeon 5770.
That's a bit disappointing, as it's only 1 float per Vec5 unit per clock. In ATI's counter-Fermi presentation they stated the 5870 could access LDS at 960 floats per clock, i.e. 3 per Vec5 per clk.
A question :Does the performance stay the same if you add writes to the LDS?
Thursday, 26 November 2009
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment