OpenCL Sorting

Eric Bainville - June 2011

Parallel merge, local

Here again, we sort in local memory first. We execute N threads in workgroups of WG threads, and each workgroup sorts a segment of WG input records. The output contains N/WG ordered sequences.

In merge sort, the main step of the algorithm is to merge together two sorted sequences. Suppose a₀,a₁,...,a_p-1 and b₀,b₁,...,b_p-1 are two sorted sequences. The position of a_i in the output is i+f(b,a_i), where f(seq,x) is the number of values in sequence seq that are less than x. Since b is sorted, this number can be obtained by dichotomic search in log₂(p) iterations. The same goes for the position of b_i in the output. This is implemented in the following kernel, each thread executes its own full dichotomic search in the sibling sequence:

__kernel void ParallelMerge_Local(__global const data_t * in,__global data_t * out,__local data_t * aux)
{
  int i = get_local_id(0); // index in workgroup
  int wg = get_local_size(0); // workgroup size = block size, power of 2

  // Move IN, OUT to block start
  int offset = get_group_id(0) * wg;
  in += offset; out += offset;

  // Load block in AUX[WG]
  aux[i] = in[i];
  barrier(CLK_LOCAL_MEM_FENCE); // make sure AUX is entirely up to date

  // Now we will merge sub-sequences of length 1,2,...,WG/2
  for (int length=1;length<wg;length<<=1)
  {
    data_t iData = aux[i];
    uint iKey = getKey(iData);
    int ii = i & (length-1);  // index in our sequence in 0..length-1
    int sibling = (i - ii) ^ length; // beginning of the sibling sequence
    int pos = 0;
    for (int inc=length;inc>0;inc>>=1) // increment for dichotomic search
    {
      int j = sibling+pos+inc-1;
      uint jKey = getKey(aux[j]);
      bool smaller = (jKey < iKey) || ( jKey == iKey && j < i );
      pos += (smaller)?inc:0;
      pos = min(pos,length);
    }
    int bits = 2*length-1; // mask for destination
    int dest = ((ii + pos) & bits) | (i & ~bits); // destination index in merged sequence
    barrier(CLK_LOCAL_MEM_FENCE);
    aux[dest] = iData;
    barrier(CLK_LOCAL_MEM_FENCE);
  }

  // Write output
  out[i] = aux[i];
}

ParallelMerge_Local
WG	Top speed
1	178
2	150
4	163
8	203
16	280
32	406
64	616
128	486
256	383

Performance of the ParallelMerge_Local kernel (Key+Value sort), Mkey/s.

The code is a little tricky to write correctly :-). Each thread performs O(1) global memory accesses, and O(log²(WG)) local memory accesses. We can observe this O(log²(WG)) behaviour for the largest values.