This is a good question!
Unfortunately on current hardware the grid is only 2D, which makes it tricky to calculate 3D indicies. To makes things worse, integer divide and modulo (which you would normally use to calculate your own 3d indicies) are very expensive on current GPUs.
The best solution I've seen is this code (credit to Jonathan Cohen, hopefully he doesn't mind me posting it here).
The setup code is:
CODE
__host__ void launchThreads(int nx, int ny, int nz) {
int threadsInX = 16;
int threadsInY = 4;
int threadsInZ = 4;
int blocksInX = (nx+threadsInX-1)/threadsInX;
int blocksInY = (ny+threadsInY-1)/threadsInY;
int blocksInZ = (nz+threadsInZ-1)/threadsInZ;
dim3 Dg = dim3(blocksInX, blocksInY*blocksInZ);
dim3 Db = dim3(threadsInX, threadsInY, threadsInZ);
callKernel<<
}
And the kernel code looks like this:
CODE
__global__ void callKernel(..., unsigned int blocksInY, float invBlocksInY)
{
unsigned int blockIdxz = __float2uint_rd(blockIdx.y * invBlocksInY);
unsigned int blockIdxy = blockIdx.y - __umul24(blockIdxz,blocksInY);
unsigned int i = __umul24(blockIdx.x,blockDim.x) + threadIdx.x;
unsigned int j = __umul24(blockIdx.y ,blockDim.y) + threadIdx.y;
unsigned int k = __umul24(blockIdx.z ,blockDim.z) + threadIdx.z;
// use i,j,k ...
}
We should really have a sample of 3D array processing in the SDK. Anybody want to contribute one?
0 comments:
Post a Comment