OpenCL Sorting

Eric Bainville - June 2011

Parallel selection, local

Let's now focus on sorting in local memory. We execute N threads in workgroups of WG threads, and each workgroup sorts a segment of WG input records. The output contains N/WG ordered sequences. The following kernel is adapted from the ParallelSelection_Blocks kernel, and uses the same algorithm to sort each subsequence of length WG.

__kernel void ParallelSelection_Local(__global const data_t * in,__global data_t * out,__local data_t * aux)
{
  int i = get_local_id(0); // index in workgroup
  int wg = get_local_size(0); // workgroup size = block size

  // Move IN, OUT to block start
  int offset = get_group_id(0) * wg;
  in += offset; out += offset;

  // Load block in AUX[WG]
  data_t iData = in[i];
  aux[i] = iData;
  barrier(CLK_LOCAL_MEM_FENCE);

  // Find output position of iData
  uint iKey = getKey(iData);
  int pos = 0;
  for (int j=0;j<wg;j++)
  {
    uint jKey = getKey(aux[j]);
    bool smaller = (jKey < iKey) || ( jKey == iKey && j < i ); // in[j] < in[i] ?
    pos += (smaller)?1:0;
  }

  // Store output
  out[pos] = iData;
}

ParallelSelection_Local
WG	Top speed
1	133
2	209
4	287
8	336
16	367
32	384
64	392
128	199
256	100

Performance of the ParallelSelection_Local kernel (Key+Value sort), Mkey/s.

Each thread performs O(1) global memory accesses, and O(WG) local memory accesses. When WG doubles, the work quantity doubles too, meaning the processing rate is expected to be halved. This is verified for WG>32 here.