OpenCL Sorting

Eric Bainville - June 2011

Parallel merge, local

Here again, we sort in local memory first. We execute N threads in workgroups of WG threads, and each workgroup sorts a segment of WG input records. The output contains N/WG ordered sequences.

In merge sort, the main step of the algorithm is to merge together two sorted sequences. Suppose a0,a1,...,ap-1 and b0,b1,...,bp-1 are two sorted sequences. The position of ai in the output is i+f(b,ai), where f(seq,x) is the number of values in sequence seq that are less than x. Since b is sorted, this number can be obtained by dichotomic search in log2(p) iterations. The same goes for the position of bi in the output. This is implemented in the following kernel, each thread executes its own full dichotomic search in the sibling sequence:

__kernel void ParallelMerge_Local(__global const data_t * in,__global data_t * out,__local data_t * aux)
{
  int i = get_local_id(0); // index in workgroup
  int wg = get_local_size(0); // workgroup size = block size, power of 2

  // Move IN, OUT to block start
  int offset = get_group_id(0) * wg;
  in += offset; out += offset;

  // Load block in AUX[WG]
  aux[i] = in[i];
  barrier(CLK_LOCAL_MEM_FENCE); // make sure AUX is entirely up to date

  // Now we will merge sub-sequences of length 1,2,...,WG/2
  for (int length=1;length<wg;length<<=1)
  {
    data_t iData = aux[i];
    uint iKey = getKey(iData);
    int ii = i & (length-1);  // index in our sequence in 0..length-1
    int sibling = (i - ii) ^ length; // beginning of the sibling sequence
    int pos = 0;
    for (int inc=length;inc>0;inc>>=1) // increment for dichotomic search
    {
      int j = sibling+pos+inc-1;
      uint jKey = getKey(aux[j]);
      bool smaller = (jKey < iKey) || ( jKey == iKey && j < i );
      pos += (smaller)?inc:0;
      pos = min(pos,length);
    }
    int bits = 2*length-1; // mask for destination
    int dest = ((ii + pos) & bits) | (i & ~bits); // destination index in merged sequence
    barrier(CLK_LOCAL_MEM_FENCE);
    aux[dest] = iData;
    barrier(CLK_LOCAL_MEM_FENCE);
  }

  // Write output
  out[i] = aux[i];
}
ParallelMerge_Local
WGTop speed
1178
2150
4163
8203
16280
32406
64616
128486
256383
Performance of the ParallelMerge_Local kernel (Key+Value sort), Mkey/s.

The code is a little tricky to write correctly :-). Each thread performs O(1) global memory accesses, and O(log2(WG)) local memory accesses. We can observe this O(log2(WG)) behaviour for the largest values.