OpenCL Sorting
Eric Bainville - June 2011Parallel selection, local
Let's now focus on sorting in local memory. We execute N threads in workgroups of WG threads, and each workgroup sorts a segment of WG input records. The output contains N/WG ordered sequences. The following kernel is adapted from the ParallelSelection_Blocks kernel, and uses the same algorithm to sort each subsequence of length WG.
__kernel void ParallelSelection_Local(__global const data_t * in,__global data_t * out,__local data_t * aux) { int i = get_local_id(0); // index in workgroup int wg = get_local_size(0); // workgroup size = block size // Move IN, OUT to block start int offset = get_group_id(0) * wg; in += offset; out += offset; // Load block in AUX[WG] data_t iData = in[i]; aux[i] = iData; barrier(CLK_LOCAL_MEM_FENCE); // Find output position of iData uint iKey = getKey(iData); int pos = 0; for (int j=0;j<wg;j++) { uint jKey = getKey(aux[j]); bool smaller = (jKey < iKey) || ( jKey == iKey && j < i ); // in[j] < in[i] ? pos += (smaller)?1:0; } // Store output out[pos] = iData; }
ParallelSelection_Local | |
---|---|
WG | Top speed |
1 | 133 |
2 | 209 |
4 | 287 |
8 | 336 |
16 | 367 |
32 | 384 |
64 | 392 |
128 | 199 |
256 | 100 |
Each thread performs O(1) global memory accesses, and O(WG) local memory accesses. When WG doubles, the work quantity doubles too, meaning the processing rate is expected to be halved. This is verified for WG>32 here.
OpenCL Sorting : Parallel selection | Top of Page | OpenCL Sorting : Parallel bitonic, local |