OpenCL Sorting

Eric Bainville - June 2011

Parallel Selection Sort

Basic implementation

Let's begin with a very simple algorithm. I could not find a better name for it than "parallel selection sort"; contact me if you have a better suggestion for the naming. We run N threads. Thread i iterates on the entire input vector to find the output position pos of value ini. Finally, thread i and writes ini into outpos:

__kernel void ParallelSelection(__global const data_t * in,__global data_t * out)
{
  int i = get_global_id(0); // current thread
  int n = get_global_size(0); // input size
  data_t iData = in[i];
  uint iKey = keyValue(iData);
  // Compute position of in[i] in output
  int pos = 0;
  for (int j=0;j<n;j++)
  {
    uint jKey = keyValue(in[j]); // broadcasted
    bool smaller = (jKey < iKey) || (jKey == iKey && j < i);  // in[j] < in[i] ?
    pos += (smaller)?1:0;
  }
  out[pos] = iData;
}
ParallelSelection
log2(N)Key OnlyKey+Value
81.000.99
91.231.22
101.361.36
111.441.44
121.501.50
131.481.49
141.171.31
150.680.72
160.370.38
170.220.22
180.120.12
190.060.06
Performance of the ParallelSelection kernel, Mkey/s.

This algorithm is obviously highly ineffective. Considering the total I/O is 2*N+N2 records, the total memory throughput increases, to reach 127 GB/s for N=219. Note that there is no difference in having 32-bit or 64-bit records here.

Using local memory

Instead of having all threads read values from global memory, we could try to preload blocks of values in local memory, and use them in all threads inside a workgroup. In the following code, we load BLOCK_FACTOR input records for each thread: each workgroup will load the input data by blocks of BLOCK_FACTOR * workgroup_size records.

__kernel void ParallelSelection_Blocks(__global const data_t * in,__global data_t * out,__local uint * aux)
{
  int i = get_global_id(0); // current thread
  int n = get_global_size(0); // input size
  int wg = get_local_size(0); // workgroup size
  data_t iData = in[i]; // input record for current thread
  uint iKey = keyValue(iData); // input key for current thread
  int blockSize = BLOCK_FACTOR * wg; // block size

  // Compute position of iKey in output
  int pos = 0;
  // Loop on blocks of size BLOCKSIZE keys (BLOCKSIZE must divide N)
  for (int j=0;j<n;j+=blockSize)
  {
    // Load BLOCKSIZE keys using all threads (BLOCK_FACTOR values per thread)
    barrier(CLK_LOCAL_MEM_FENCE);
    for (int index=get_local_id(0);index<blockSize;index+=wg)
      aux[index] = keyValue(in[j+index]);
    barrier(CLK_LOCAL_MEM_FENCE);

    // Loop on all values in AUX
    for (int index=0;index<blockSize;index++)
    {
      uint jKey = aux[index]; // broadcasted, local memory
      bool smaller = (jKey < iKey) || ( jKey == iKey && (j+index) < i ); // in[j] < in[i] ?
      pos += (smaller)?1:0;
    }
  }
  out[pos] = iData;
}

Note that both barrier instructions are required, the first one ensures all threads have finished the processing loop before loading a new block, and the second one ensures the block is entirely loaded before starting the next processing loop.

ParallelSelection_Blocks
BLOCK_FACTOR value
log2(N)12481632
81.80
92.662.65
103.413.423.34
113.913.943.963.79
124.244.284.274.164.15
132.362.372.372.382.392.28
141.611.611.601.591.621.55
150.820.820.820.790.810.78
160.450.450.450.450.450.43
170.220.230.230.230.230.21
180.120.120.120.120.120.11
190.060.060.060.060.060.05
Performance of the ParallelSelection_Blocks kernel (Key+Value sort), Mkey/s.

Performance is a little better, and reaches 4.28 Mkey/s thanks to the higher speed of the local memory, but we still are very far from good.