OpenCL Multiprecision

Eric Bainville - Jan 2010

Implementing the SSA in OpenCL, part II

One thread per butterfly word

The combine operation will be executed in parallel by N/2xP threads. We will run a 2D grid of threads. First dimension will be the butterfly index as before (size N/2), and second dimension will be word index in the output numbers (size P). The choice of the direction of the transform can easily be introduced in the kernel, as it only acts on the direction of the shifts. The modified kernel is the following:

// Compute N butterflies in parallel.
// N = get_global_size(0) is _half_ the row count of X and Y.
// P = get_global_size(1) is the size of each row.
// X[2*N*P] is the input buffer.
// Y[2*N*P] is the output buffer.
// K is the recursion level, from 0 to LOG2(N).
// DIR is the direction +1 = direct transform, -1 = inverse transform.
// Each butterly combines vectors I and I+(1<<STEP)
__kernel void combine_kernel(__global const int * x,__global int * y,int k,int dir)
{
  int i1 = get_global_id(0);
  int j1 = get_global_id(1);
  int p = get_global_size(1);

  // Insert a 0 bit in I1 at position STEP, to get the index of the first vector to combine
  int bit = (1<<k);
  int m = i1 & (bit-1); // M = lower STEP-1 bits of I
  i1 = ((i1-m)<<1)+m;
  int i2 = i1 | bit; // index of the second vector to combine
  // Here, I1 and I2 are the indices of the rows to combine,

  int shift = (p>>k)*m;
  int sign = (shift & p)?-1:1;
  shift &= p-1; // shift mod p, because p is a power of 2
  int j2 = j1 - dir * shift;
  if (j2<0) { j2 += p; sign = -sign; }
  if (j2>=p) { j2 -= p; sign = -sign; }
  // Here J1 and J2 are the input indices of the words to combine in these rows.
  // On output, both word indices are J1.

  int x1 = x[p*i1+j1];
  int x2 = x[p*i2+j2];
  y[p*i1+j1] = (sign>0)?(x1+x2):(x1-x2);
  y[p*i2+j1] = (sign<0)?(x1+x2):(x1-x2);
}

The other operations are implemented using the same 2D pattern of NxP threads, and work on numbers stored on P words modulo B^P+1:

brcopy does a k bit-reversal (only the k least-significant bits of the row indices are reversed) copy of N rows of P words,
normalize normalizes the redundant representation of N rows of P words,
shr shifts right by k bits N rows of P words.

mpz is GMP 5.0.0 running on Core 2 Q9600
GPU is NVidia GTX285, driver 195.30
fft is brcopy+shr+log2(N)*(combine+normalize)
times are in milliseconds, rounded to 2 significant digits

  N       P      mpz  brcopy  normalize  shr combine    fft
1024	 512      25    0.14    0.095   0.091   0.092   2.1
1024	1024      57    0.25    0.14    0.18    0.15    3.4
1024	2048     140    0.46    0.26    0.26    0.26    5.9
2048	1024     140    0.49    0.26    0.26    0.27    6.5
2048	2048     310    0.97    0.48    0.57    0.48   12
2048	4096     710    2.0     0.92    0.97    0.91   23
4096	2048     700    1.7     0.91    0.96    0.92   25
4096	4096    1600    3.2     1.8     1.9     1.8    48
4096    8192    3400    6.3     3.9     3.9     3.5    98

Since memory access is the limiting factor here, incorporating the brcopy into the first combine, and fusing combine and normalize would yeld a significant gain.

Multiplying the two large numbers represented by NxP words requires 2 direct FFT's, one reverse FFT, and the N products of numbers stored on P words. These products are actually much longer than the FFT.