Kernels for performing the Scan operation.
More...
Functions | |
| kernel void | inclusiveScan_f (global float4 *in, global float4 *out, local float *data, global float *sums, uint n, float scaling) |
| Performs an inclusive scan operation on the columns of an array. More... | |
| kernel void | addGroupSums_f (global float *sums, global float4 *out, uint n) |
| Adds the group sums in the associated blocks. More... | |
Kernels for performing the Scan operation.
| kernel void addGroupSums_f | ( | global float * | sums, |
| global float4 * | out, | ||
| uint | n | ||
| ) |
Adds the group sums in the associated blocks.
It's the second part of the Blelloch scan algorithm.
scan handled 2 float4 elements per work-item. addGroupSums handles 1 float4 element per work-item. The global workspace should be \( 2*(wgXdim-1)*lXdim_{scan} \) in the x dimension, and \( M \) in the y dimension. The global workspace should also have an offset \( 2*lXdim_{scan} \) in the x dimension. The local workspace should be \( 2*lXdim_{scan} \) in the x dimension, and 1 in the y dimension. | [in] | sums | (scan) array of work-group sums. Its size is \(M \times wgXdim\). |
| [out] | out | (scan) output array of float elements (before processing, it contains the block scans performed in a previous step. |
| [in] | n | the number of elements in a row of the array divided by 4. |
| kernel void inclusiveScan_f | ( | global float4 * | in, |
| global float4 * | out, | ||
| local float * | data, | ||
| global float * | sums, | ||
| uint | n, | ||
| float | scaling | ||
| ) |
Performs an inclusive scan operation on the columns of an array.
The parallel scan algorithm by Blelloch is implemented.
N, in a row of the array should be a multiple of 4 (the data are handled as float4). The x dimension of the global workspace, \( gXdim \), should be greater than or equal to the number of elements in a row of the array divided by 8. That is, \( \ gXdim \geq N/8 \). Each work-item handles 8 float (= 2 float4) elements in a row of the array. The y dimension of the global workspace, \( gYdim \), should be equal to the number of rows, M, in the array. That is, \( \ gYdim = M \). The local workspace should be 1 in the y dimension, and a power of 2 in the x dimension. It is recommended to use one wavefront/warp per work-group. 0, in the sums array, since in the next phase the sums array is going to be handled as float4.| [in] | in | input array of float elements. |
| [out] | out | (scan per work-group) output array of float elements. |
| [in] | data | local buffer. Its size should be 2 float elements for each work-item in a work-group. That is \( 2*lXdim*sizeof\ (float) \). |
| [out] | sums | array of block sums. Each work-group outputs the sum of its elements. It's size should be \( M \times wgXdim \). |
| [in] | n | the number of elements in a row of the array divided by 4. |
| [in] | scaling | factor by which to scale the array elements before processing. |
1.8.9.1