# OpenCL Multiprecision

Eric Bainville - Jan 2010## Implementing the SSA in OpenCL, part II

### One thread per butterfly word

The combine operation will be executed in parallel by N/2xP threads. We will run a 2D grid of threads. First dimension will be the butterfly index as before (size N/2), and second dimension will be word index in the output numbers (size P). The choice of the direction of the transform can easily be introduced in the kernel, as it only acts on the direction of the shifts. The modified kernel is the following:

// Compute N butterflies in parallel. // N = get_global_size(0) is _half_ the row count of X and Y. // P = get_global_size(1) is the size of each row. // X[2*N*P] is the input buffer. // Y[2*N*P] is the output buffer. // K is the recursion level, from 0 to LOG2(N). // DIR is the direction +1 = direct transform, -1 = inverse transform. // Each butterly combines vectors I and I+(1<<STEP) __kernel void combine_kernel(__global const int * x,__global int * y,int k,int dir) { int i1 = get_global_id(0); int j1 = get_global_id(1); int p = get_global_size(1); // Insert a 0 bit in I1 at position STEP, to get the index of the first vector to combine int bit = (1<<k); int m = i1 & (bit-1); // M = lower STEP-1 bits of I i1 = ((i1-m)<<1)+m; int i2 = i1 | bit; // index of the second vector to combine // Here, I1 and I2 are the indices of the rows to combine, int shift = (p>>k)*m; int sign = (shift & p)?-1:1; shift &= p-1; // shift mod p, because p is a power of 2 int j2 = j1 - dir * shift; if (j2<0) { j2 += p; sign = -sign; } if (j2>=p) { j2 -= p; sign = -sign; } // Here J1 and J2 are the input indices of the words to combine in these rows. // On output, both word indices are J1. int x1 = x[p*i1+j1]; int x2 = x[p*i2+j2]; y[p*i1+j1] = (sign>0)?(x1+x2):(x1-x2); y[p*i2+j1] = (sign<0)?(x1+x2):(x1-x2); }

The other operations are implemented using the same 2D pattern of NxP threads,
and work on numbers stored on P words modulo B^{P}+1:

- brcopy does a k bit-reversal (only the k least-significant bits of the row indices are reversed) copy of N rows of P words,
- normalize normalizes the redundant representation of N rows of P words,
- shr shifts right by k bits N rows of P words.

mpz is GMP 5.0.0 running on Core 2 Q9600 GPU is NVidia GTX285, driver 195.30 fft is brcopy+shr+log2(N)*(combine+normalize) times are in milliseconds, rounded to 2 significant digitsN P mpz brcopy normalize shr combine fft1024 512 25 0.14 0.095 0.091 0.092 2.1 1024 1024 57 0.25 0.14 0.18 0.15 3.4 1024 2048 140 0.46 0.26 0.26 0.26 5.9 2048 1024 140 0.49 0.26 0.26 0.27 6.5 2048 2048 310 0.97 0.48 0.57 0.48 12 2048 4096 710 2.0 0.92 0.97 0.91 23 4096 2048 700 1.7 0.91 0.96 0.92 25 4096 4096 1600 3.2 1.8 1.9 1.8 48 4096 8192 3400 6.3 3.9 3.9 3.5 98

Since memory access is the limiting factor here, incorporating the brcopy into the first combine, and fusing combine and normalize would yeld a significant gain.

Multiplying the two large numbers represented by NxP words requires 2 direct FFT's,
one reverse FFT, and the N products of numbers stored on P words. These products
are actually **much longer** than the FFT.

OpenCL Multiprecision : OpenCL SSA I | Top of Page | GPU Mandelbrot Set |