OpenCL Sorting
Eric Bainville - June 2011Parallel Selection Sort
Basic implementation
Let's begin with a very simple algorithm. I could not find a better name for it than "parallel selection sort"; contact me if you have a better suggestion for the naming. We run N threads. Thread i iterates on the entire input vector to find the output position pos of value ini. Finally, thread i and writes ini into outpos:
__kernel void ParallelSelection(__global const data_t * in,__global data_t * out) { int i = get_global_id(0); // current thread int n = get_global_size(0); // input size data_t iData = in[i]; uint iKey = keyValue(iData); // Compute position of in[i] in output int pos = 0; for (int j=0;j<n;j++) { uint jKey = keyValue(in[j]); // broadcasted bool smaller = (jKey < iKey) || (jKey == iKey && j < i); // in[j] < in[i] ? pos += (smaller)?1:0; } out[pos] = iData; }
ParallelSelection | ||
---|---|---|
log2(N) | Key Only | Key+Value |
8 | 1.00 | 0.99 |
9 | 1.23 | 1.22 |
10 | 1.36 | 1.36 |
11 | 1.44 | 1.44 |
12 | 1.50 | 1.50 |
13 | 1.48 | 1.49 |
14 | 1.17 | 1.31 |
15 | 0.68 | 0.72 |
16 | 0.37 | 0.38 |
17 | 0.22 | 0.22 |
18 | 0.12 | 0.12 |
19 | 0.06 | 0.06 |
This algorithm is obviously highly ineffective. Considering the total I/O is 2*N+N2 records, the total memory throughput increases, to reach 127 GB/s for N=219. Note that there is no difference in having 32-bit or 64-bit records here.
Using local memory
Instead of having all threads read values from global memory, we could try to preload blocks of values in local memory, and use them in all threads inside a workgroup. In the following code, we load BLOCK_FACTOR input records for each thread: each workgroup will load the input data by blocks of BLOCK_FACTOR * workgroup_size records.
__kernel void ParallelSelection_Blocks(__global const data_t * in,__global data_t * out,__local uint * aux) { int i = get_global_id(0); // current thread int n = get_global_size(0); // input size int wg = get_local_size(0); // workgroup size data_t iData = in[i]; // input record for current thread uint iKey = keyValue(iData); // input key for current thread int blockSize = BLOCK_FACTOR * wg; // block size // Compute position of iKey in output int pos = 0; // Loop on blocks of size BLOCKSIZE keys (BLOCKSIZE must divide N) for (int j=0;j<n;j+=blockSize) { // Load BLOCKSIZE keys using all threads (BLOCK_FACTOR values per thread) barrier(CLK_LOCAL_MEM_FENCE); for (int index=get_local_id(0);index<blockSize;index+=wg) aux[index] = keyValue(in[j+index]); barrier(CLK_LOCAL_MEM_FENCE); // Loop on all values in AUX for (int index=0;index<blockSize;index++) { uint jKey = aux[index]; // broadcasted, local memory bool smaller = (jKey < iKey) || ( jKey == iKey && (j+index) < i ); // in[j] < in[i] ? pos += (smaller)?1:0; } } out[pos] = iData; }
Note that both barrier instructions are required, the first one ensures all threads have finished the processing loop before loading a new block, and the second one ensures the block is entirely loaded before starting the next processing loop.
ParallelSelection_Blocks | ||||||
---|---|---|---|---|---|---|
BLOCK_FACTOR value | ||||||
log2(N) | 1 | 2 | 4 | 8 | 16 | 32 |
8 | 1.80 | |||||
9 | 2.66 | 2.65 | ||||
10 | 3.41 | 3.42 | 3.34 | |||
11 | 3.91 | 3.94 | 3.96 | 3.79 | ||
12 | 4.24 | 4.28 | 4.27 | 4.16 | 4.15 | |
13 | 2.36 | 2.37 | 2.37 | 2.38 | 2.39 | 2.28 |
14 | 1.61 | 1.61 | 1.60 | 1.59 | 1.62 | 1.55 |
15 | 0.82 | 0.82 | 0.82 | 0.79 | 0.81 | 0.78 |
16 | 0.45 | 0.45 | 0.45 | 0.45 | 0.45 | 0.43 |
17 | 0.22 | 0.23 | 0.23 | 0.23 | 0.23 | 0.21 |
18 | 0.12 | 0.12 | 0.12 | 0.12 | 0.12 | 0.11 |
19 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 | 0.05 |
Performance is a little better, and reaches 4.28 Mkey/s thanks to the higher speed of the local memory, but we still are very far from good.
OpenCL Sorting : Introduction | Top of Page | OpenCL Sorting : Parallel selection, local |