======
For Ps3 and Cell
*OpenCL 1.0 without extensions (32bit only).
*Includes OpenCL guide.
*Includes samples (julia,
*Includes a program to build kernel binaries.
Devices:
=====
- Cell PPU (CL_DEVICE_TYPE_CPU):Full profile (Power/VMX CPU)
- CPU device global and local memory both map to system memory
- FP denorms and fmad ieee2008 support
- # of compute units on a SPU accelerator device is <=16.
- Local memory size <= 256KB.
- FP denorms,FP rtz,64-bit long/ulong,and fmad ieee2008 support
- Better access each thread float4 for efficient DMA and __local and SIMD instructions (multiple also work float16)
Any external usage of SPUs will result in a CL_DEVICE_NOT_AVAILABLE failure at context creation.
Both devices support:
- Device and native kernel execution
- Compilers are available
- Out-of-order command execution
- Using built binaries without compiling
Native kernel execution takes place on the CPU device for both supported OpenCL device types
i.e. in PPU for Cell.
Unsupported:
*OpenCL image objects
*OpenCL sampler objects
*Halfs and doubles (no ext)
* No atomics
* No byte addresable
*No OpenGL interop of course
Workgroups:
*3D
*Overall 256 work-items per workgroup
The global memory size and maximum memory allocation size of both device types is dependent on the overall size and availability of system memory.
In addition to any vector code within an application's kernels, the OpenCL compiler may auto-vectorize code within a work-item, and across work-items within a work-group.
OpenCL on cell
==========
Workloads that have simple control logic or high bandwidth regular access patterns map well to SPUs. These workloads map well to NDRanges because their access patterns are well defined. In addition to NDRanges, the OpenCL application programmer can create an OpenCL task that implements a double-buffering scheme by managing local store directly with async_work_group_copy() and __local variables. The SPU's ability to efficiently execute the OpenCL task and data parallel programming models allows it to run a wide range of workloads.
On CBEA systems, OpenCL applications should maximize the amount of work done on the accelerator device. Each SPU is a compute unit in the OpenCL accelerator device. The OpenCL runtime will schedule work-groups across and execute work items on the SPUs. The section Command Queues describes how best to structure an application for maximum concurrency.
SPUs
====
Each SPU has 256K of local storage that will be divided among the OpenCL kernel runtime, OpenCL program's text, __local variables and __private variables. An OpenCL program may contain one or more OpenCL kernels that share local storage. Kernels that require large amounts of local storage for __local and __private variables may have to reduce their work-group size because of the lack of local storage. These kernels should, instead, be separated into their own program so other kernels' resources do not limit their work-group size. However, kernels that do not require large amounts of local storage should be grouped together into the same program. OpenCL applications that group kernels together into a program may avoid unnecessary context-switching because all kernels in a program are loaded together.
Proper management of data flow into and out of the SPU is crucial to maximize performance. This includes managing local storage effectively by staging data whenever possible. The OpenCL runtime utilizes a software data cache that caches accesses to __global memory in local storage. When possible, it is preferable for a kernel to aggregate all loads for a work-group into a single async_work_group_copy() to a __local variable. This will improve performance by grouping all of the work-items load latencies into one common load or store. The load will also be larger, making more efficient use of the DMA engine. If all accesses to __global memory are issued with async_work_group_copy() instead of direct access through the __global pointer, the software data cache will not be included, saving ~80KB of local storage.
An OpenCL task may also implement a double-buffering scheme. Two or more __local variables can be used as buffers to stage data. The kernel can then initiate an async_work_group_copy() into one buffer, then compute the results on the second buffer. async_work_group_copy() will use the SPU's DMA engine to copy data while the SPU's vector engine is free to operate on the second buffer. Double-buffering maximizes performance by keeping the compute engine busy by eliminating the need to wait on data transfers.
The OpenCL application should use __global memory buffers whose type's size is a multiple of a quad-word (16 bytes). For example, a kernel that operates sequentially on an array of floats should instead aggregate four floats together and operate on a float4 vector. This will allow the OpenCL compiler to map vector operations to the SPU's native vector types, and optimize its use of the SPU's DMA engine by eliminating alignment checks. Code that uses large vectors (that are a multiple of a quad-word, e.g. float16) is easier to read than hand unrolling loops. The large vectors will be automatically unrolled by the compiler to operate efficiently on the SPU's vector engines.
If an application does not require strict IEEE mathematical compliance, the OpenCL kernels can be built with the -cl-fast-relaxed-math compile option. This will allow the compiler to include performance optimizing, code transformations like:
• Floating-point conditionals may be transformed such that strict compare ordering in the presense of NaNs may not be preserved.
• Floating point divides may be transformed into a reciprocal-multiply.
• Software support of infinites and NaNs is omitted for half_divide and half_recip built-ins.
CPU
----
On Power processors, the same hardware cache is used for __private, __local, __constant and __global variables. It is counterproductive to create __local variables to stage __global memory. Instead, the application should take special care to layout __global memory so that the data for work-items in a work-group is cache friendly
In CPU async_work_group_copy() host copy avoid and prefetch() not implemented
CURRENT RESTRICTIONS
==================
The -cl-single-precision-constant and -cl-denorms-are-zero compilation options are ignored.
- To use work-group sizes greater than 1, the reqd_work_group_size kernel attribute qualifier must be specified. Increasing the work-group size generally increases performance. work-group sizes greater than 1+ (-cl-opt-disable)->error
- No half datatype
OpenCL Build Program utility
===================
This utility will build a kernel binary from a user specified source, using the
OpenCL APIs. A binary version of this utility is shipped in the OpenCL runtime
rpm, and installed in as /usr/bin/opencl_build_program.
The utility will read in the source file (either from a filename or from
stdin), get the list of Device IDs that match the request (default is
CL_DEVICE_TYPE_DEFAULT), create an OpenCL Context for that device, create an
OpenCL Program from the source specified, call to build the program for the
devices, get the built binary data and save it to a file.
Usage: ./opencl_build_program [DEVICE] [OPTIONS...] [FILE]
Build OpenCL kernel binary for specified device type from the specified FILE.
With no FILE, or when FILE is -, source will be read from stdard input.
Device Types: (only specify one)
-a, --accel build for CL_DEVICE_TYPE_ACCELERATOR
-c, --cpu build for CL_DEVICE_TYPE_CPU
-A, --all build for CL_DEVICE_TYPE_ALL
-d, --default build for CL_DEVICE_TYPE_DEFAULT (default)
Options:
-f, --flags "options" OpenCL Buld option flags (default: none)
-o, --output
-q, --quiet no output (default: not quiet)
-h, --help display usage information and exit
$ cat kernel.cl
__kernel void function(__global int *a)
{
int gid = get_global_id(0);
a[gid] = 1;
}
$ ./opencl_build_program --cpu kernel.cl
Binary built from source file "kernel.cl" for device "CPU Cell Broadband Engine, altivec supported" saved as file "kernel_CPU_Cell_Broadband_Engine,_altivec_supported.ocl"
ACCELERATOR CellBE processor
1
spu-xlcl kernel.c -o kernel.ocl -qcpluscmt -ma -qea32 -qnocrt -Wl,-entry=0 -Wl,-N -u _plugin_init
#
ACCELERATOR PowerXCell8i processor
1
spu-xlcl kernel.c -o kernel.ocl -qcpluscmt -ma -qea32 -qnocrt -Wl,-entry=0 -Wl,-N -u _plugin_init -qarch=edp -qtune=edp
#
static inline const char *GetErrorString(int errcode) {
switch (errcode) {
case CL_SUCCESS:
return "CL_SUCCESS";
case CL_DEVICE_NOT_FOUND:
return "CL_DEVICE_NOT_FOUND";
case CL_DEVICE_NOT_AVAILABLE:
return "CL_DEVICE_NOT_AVAILABLE";
case CL_COMPILER_NOT_AVAILABLE:
return "CL_COMPILER_NOT_AVAILABLE";
case CL_MEM_OBJECT_ALLOCATION_FAILURE:
return "CL_MEM_OBJECT_ALLOCATION_FAILURE";
case CL_OUT_OF_RESOURCES:
return "CL_OUT_OF_RESOURCES";
case CL_OUT_OF_HOST_MEMORY:
return "CL_OUT_OF_HOST_MEMORY";
case CL_PROFILING_INFO_NOT_AVAILABLE:
return "CL_PROFILING_INFO_NOT_AVAILABLE";
case CL_MEM_COPY_OVERLAP:
return "CL_MEM_COPY_OVERLAP";
case CL_IMAGE_FORMAT_MISMATCH:
return "CL_IMAGE_FORMAT_MISMATCH";
case CL_IMAGE_FORMAT_NOT_SUPPORTED:
return "CL_IMAGE_FORMAT_NOT_SUPPORTED";
case CL_BUILD_PROGRAM_FAILURE:
return "CL_BUILD_PROGRAM_FAILURE";
case CL_INVALID_VALUE:
return "CL_INVALID_VALUE";
case CL_INVALID_DEVICE_TYPE:
return "CL_INVALID_DEVICE_TYPE";
case CL_INVALID_PLATFORM:
return "CL_INVALID_PLATFORM";
case CL_INVALID_DEVICE:
return "CL_INVALID_DEVICE";
case CL_INVALID_CONTEXT:
return "CL_INVALID_CONTEXT";
case CL_INVALID_QUEUE_PROPERTIES:
return "CL_INVALID_QUEUE_PROPERTIES";
case CL_INVALID_COMMAND_QUEUE:
return "CL_INVALID_COMMAND_QUEUE";
case CL_INVALID_HOST_PTR:
return "CL_INVALID_HOST_PTR";
case CL_INVALID_MEM_OBJECT:
return "CL_INVALID_MEM_OBJECT";
case CL_INVALID_IMAGE_FORMAT_DESCRIPTOR:
return "CL_INVALID_IMAGE_FORMAT_DESCRIPTOR";
case CL_INVALID_IMAGE_SIZE:
return "CL_INVALID_IMAGE_SIZE";
case CL_INVALID_SAMPLER:
return "CL_INVALID_SAMPLER";
case CL_INVALID_BINARY:
return "CL_INVALID_BINARY";
case CL_INVALID_BUILD_OPTIONS:
return "CL_INVALID_BUILD_OPTIONS";
case CL_INVALID_PROGRAM:
return "CL_INVALID_PROGRAM";
case CL_INVALID_PROGRAM_EXECUTABLE:
return "CL_INVALID_PROGRAM_EXECUTABLE";
case CL_INVALID_KERNEL_NAME:
return "CL_INVALID_KERNEL_NAME";
case CL_INVALID_KERNEL_DEFINITION:
return "CL_INVALID_KERNEL_DEFINITION";
case CL_INVALID_KERNEL:
return "CL_INVALID_KERNEL";
case CL_INVALID_ARG_INDEX:
return "CL_INVALID_ARG_INDEX";
case CL_INVALID_ARG_VALUE:
return "CL_INVALID_ARG_VALUE";
case CL_INVALID_ARG_SIZE:
return "CL_INVALID_ARG_SIZE";
case CL_INVALID_KERNEL_ARGS:
return "CL_INVALID_KERNEL_ARGS";
case CL_INVALID_WORK_DIMENSION:
return "CL_INVALID_WORK_DIMENSION";
case CL_INVALID_WORK_GROUP_SIZE:
return "CL_INVALID_WORK_GROUP_SIZE";
case CL_INVALID_WORK_ITEM_SIZE:
return "CL_INVALID_WORK_ITEM_SIZE";
case CL_INVALID_GLOBAL_OFFSET:
return "CL_INVALID_GLOBAL_OFFSET";
case CL_INVALID_EVENT_WAIT_LIST:
return "CL_INVALID_EVENT_WAIT_LIST";
case CL_INVALID_EVENT:
return "CL_INVALID_EVENT";
case CL_INVALID_OPERATION:
return "CL_INVALID_OPERATION";
case CL_INVALID_GL_OBJECT:
return "CL_INVALID_GL_OBJECT";
case CL_INVALID_BUFFER_SIZE:
return "CL_INVALID_BUFFER_SIZE";
case CL_INVALID_MIP_LEVEL:
return "CL_INVALID_MIP_LEVEL";
case CL_INVALID_GLOBAL_WORK_SIZE:
return "CL_INVALID_GLOBAL_WORK_SIZE";
default:
return "Unknown";
};
}
0 comments:
Post a Comment