# OpenCL Fast Fourier Transform

Eric Bainville - May 2010

We will now merge two consecutive iterations. Two iterations of the FFT radix-2 algorithm will combine 4 sub-sequences of length p into one sequence of length 4p, corresponding to a "radix-4" algorithm. The radix-4 "butterfly" takes 4 input cells and produces 4 outputs. It computes 4 radix-2 butterflies.

The radix-4 butterfly simplifies to a length 4 DFT applied on scaled inputs (a+0.u,k)p, (a+1.u,k)p * ω1, (a+2.u,k)p * ω2, and (a+3.u,k)p * ω3, where ω=e-2*π*i/(2p).

First, we have to define the following new functions:

```// Return a*EXP(-I*PI*1/2) = a*(-I)
float2 mul_p1q2(float2 a) { return (float2)(a.y,-a.x); }

// Return a^2
float2 sqr_1(float2 a)
{ return (float2)(a.x*a.x-a.y*a.y,2.0f*a.x*a.y); }

// Return the 2x DFT2 of the four complex numbers in A
// If A=(a,b,c,d) then return (a',b',c',d') where (a',c')=DFT2(a,c)
// and (b',d')=DFT2(b,d).
float8 dft2_4(float8 a) { return (float8)(a.lo+a.hi,a.lo-a.hi); }

// Return the DFT of 4 complex numbers in A
float8 dft4_4(float8 a)
{
// 2x DFT2
float8 x = dft2_4(a);
// Shuffle, twiddle, and 2x DFT2
return dft2_4((float8)(x.lo.lo,x.hi.lo,x.lo.hi,mul_p1q2(x.hi.hi)));
}
```

We implemented several versions of the radix-4 kernel:

```// T = N/4 = number of threads.
// P is the length of input sub-sequences, 1,4,16,...,N/4.
__kernel void fft_radix4(__global const float2 * x,__global float2 * y,int p)
{
int t = get_global_size(0); // number of threads
int i = get_global_id(0); // current thread
int k = i & (p-1); // index in input sequence, in 0..P-1
// Inputs indices are I+{0,1,2,3}*T
x += i;
// Output indices are J+{0,1,2,3}*P, where
// J is I with two 0 bits inserted at bit log2(P)
y += ((i-k)<<2) + k;

// Twiddling factors are exp(_I*PI*{0,1,2,3}*K/2P)
float alpha = -M_PI*(float)k/(float)(2*p);

[VARIANT CODE]
}
```

Variant A, working with float2 numbers:

```// Load and twiddle
float2 u0 = x;
float2 u1 = mul_1(exp_alpha_1(alpha),x[t]);
float2 u2 = mul_1(exp_alpha_1(2*alpha),x[2*t]);
float2 u3 = mul_1(exp_alpha_1(3*alpha),x[3*t]);

// 2x DFT2 and twiddle
float2 v0 = u0 + u2;
float2 v1 = u0 - u2;
float2 v2 = u1 + u3;
float2 v3 = mul_p1q2(u1 - u3); // twiddle

// 2x DFT2 and store
y = v0 + v2;
y[p] = v1 + v3;
y[2*p] = v0 - v2;
y[3*p] = v1 - v3;
```

Variant B, working with float8 numbers and the dft4_4 function:

```// Load, and DFT4
float8 u = dft4_4( (float8)(
x,
mul_1(exp_alpha_1(alpha),x[t]),
mul_1(exp_alpha_1(2*alpha),x[2*t]),
mul_1(exp_alpha_1(3*alpha),x[3*t]) ));
// Store
y = u.lo.lo;
y[p] = u.lo.hi;
y[2*p] = u.hi.lo;
y[3*p] = u.hi.hi;
```

And variant C: variant A where the 3 calls to exp_alpha and 3 mul are replaced with one single call to exp_alpha and 5 mul:

```// Load and twiddle, one exp_alpha computed instead of 3
float2 twiddle = exp_alpha_1(alpha);
float2 u0 = x;
float2 u1 = mul_1(twiddle,x[t]);
float2 u2 = x[2*t];
float2 u3 = mul_1(twiddle,x[3*t]);
twiddle = sqr_1(twiddle);
u2 = mul_1(twiddle,u2);
u3 = mul_1(twiddle,u3);

// 2x DFT2 and twiddle
float2 v0 = u0 + u2;
float2 v1 = u0 - u2;
float2 v2 = u1 + u3;
float2 v3 = mul_p1q2(u1 - u3); // twiddle

// 2x DFT2 and store
y = v0 + v2;
y[p] = v1 + v3;
y[2*p] = v0 - v2;
y[3*p] = v1 - v3;
```

log2(N)/2 iterations each running N/4 threads are required. I measured the best time (among all work-group sizes) for all 3 variants of the kernel and all 3 variants of the cos+sin computation:

ABCABC
sincos 38 ms37 ms31 ms33 ms34 ms38 ms
sin+cos 33 ms32 ms31 ms51 ms51 ms39 ms
native sin+cos33 ms32 ms31 ms38 ms39 ms38 ms
Running times for the radix-4 kernel, N=224. The kernel is called 12 times.

For the NVidia GTX285, the best performance (green cells) is obtained when computing only one cos+sin instead of three, independently of the cos+sin computation variant. On the HD5870, the best performance (red cell) is reached while working on float2 numbers, and using sincos to compute cos+sin. On the ATI GPU, the native cos+sin is slightly slower, and the precise cos+sin is really slow.

Pushing further this idea, we can implement a radix-8 kernel. Let's first define the required functions:

```// mul_p*q*(a) returns a*EXP(-I*PI*P/Q)
#define mul_p0q1(a) (a)

#define mul_p0q2 mul_p0q1
float2  mul_p1q2(float2 a) { return (float2)(a.y,-a.x); }

__constant float SQRT_1_2 = 0.707106781188f; // cos(Pi/4)
#define mul_p0q4 mul_p0q2
float2  mul_p1q4(float2 a) { return (float2)(SQRT_1_2)*(float2)(a.x+a.y,-a.x+a.y); }
#define mul_p2q4 mul_p1q2
float2  mul_p3q4(float2 a) { return (float2)(SQRT_1_2)*(float2)(-a.x+a.y,-a.x-a.y); }
```

```// T = N/8 = number of threads.
// P is the length of input sub-sequences, 1,8,64,...,N/8.
__kernel void fft_radix8(__global const float2 * x,__global float2 * y,int p)
{
int t = get_global_size(0); // number of threads
int i = get_global_id(0); // current thread
int k = i & (p-1); // index in input sequence, in 0..P-1
// Inputs indices are I+{0,1,2,3,4,5,6,7}*T
x += i;
// Output indices are J+{0,1,2,3,4,5,6,7}*P, where
// J is I with three 0 bits inserted at bit log2(P)
y += ((i-k)<<3) + k;

// Twiddling factors are exp(_I*PI*{0,1,2,3,4,5,6,7}*K/4P)
float alpha = -M_PI*(float)k/(float)(4*p);

{VARIANT CODE]

// 4x in-place DFT2 and twiddle
float2 v0 = u0 + u4;
float2 v4 = mul_p0q4(u0 - u4);
float2 v1 = u1 + u5;
float2 v5 = mul_p1q4(u1 - u5);
float2 v2 = u2 + u6;
float2 v6 = mul_p2q4(u2 - u6);
float2 v3 = u3 + u7;
float2 v7 = mul_p3q4(u3 - u7);

// 4x in-place DFT2 and twiddle
u0 = v0 + v2;
u2 = mul_p0q2(v0 - v2);
u1 = v1 + v3;
u3 = mul_p1q2(v1 - v3);
u4 = v4 + v6;
u6 = mul_p0q2(v4 - v6);
u5 = v5 + v7;
u7 = mul_p1q2(v5 - v7);

// 4x DFT2 and store (reverse binary permutation)
y   = u0 + u1;
y[p]   = u4 + u5;
y[2*p] = u2 + u3;
y[3*p] = u6 + u7;
y[4*p] = u0 - u1;
y[5*p] = u4 - u5;
y[6*p] = u2 - u3;
y[7*p] = u6 - u7;
}
```

Variant A of the load and twiddle code computes cos+sin 7 times, and 7 complex number products:

```// Load and twiddle (variant A)
float2 u0 = x;
float2 u1 = mul_1(exp_alpha_1(alpha),x[t]);
float2 u2 = mul_1(exp_alpha_1(2*alpha),x[2*t]);
float2 u3 = mul_1(exp_alpha_1(3*alpha),x[3*t]);
float2 u4 = mul_1(exp_alpha_1(4*alpha),x[4*t]);
float2 u5 = mul_1(exp_alpha_1(5*alpha),x[5*t]);
float2 u6 = mul_1(exp_alpha_1(6*alpha),x[6*t]);
float2 u7 = mul_1(exp_alpha_1(7*alpha),x[7*t]);
```

Variant B of the load and twiddle block uses only 1 cos+sin call, and computes the remaining twiddle factors using 14 complex number products. As we have seen above, it should run faster on the GTX285.

```// Load and twiddle (variant B)
float2 twiddle = exp_alpha_1(alpha); // W
float2 u0 = x;
float2 u1 = mul_1(twiddle,x[t]);
float2 u2 = x[2*t];
float2 u3 = mul_1(twiddle,x[3*t]);
float2 u4 = x[4*t];
float2 u5 = mul_1(twiddle,x[5*t]);
float2 u6 = x[6*t];
float2 u7 = mul_1(twiddle,x[7*t]);
twiddle = sqr_1(twiddle); // W^2
u2 = mul_1(twiddle,u2);
u3 = mul_1(twiddle,u3);
u6 = mul_1(twiddle,u6);
u7 = mul_1(twiddle,u7);
twiddle = sqr_1(twiddle); // W^4
u4 = mul_1(twiddle,u4);
u5 = mul_1(twiddle,u5);
u6 = mul_1(twiddle,u6);
u7 = mul_1(twiddle,u7);
```

As in the radix-4 kernels, I measured the best execution time for the two variants of the code and the three variants of cos+sin computation:

A B A B
sincos 40 ms30 ms25 ms30 ms
sin+cos 31 ms30 ms40 ms29 ms
native sin+cos31 ms30 ms29 ms29 ms
Running times for the radix-8 kernel, N=224. The kernel is called 8 times.

The GTX285 performance remains comparable to kernel-4 performance. On the HD5870 side, we observe a nice improvement. Let's continue in the same spirit with a radix-16 kernel.

Extension to radix-16 is straightforward. First define the twiddling functions, and a combined DFT2+twiddle macro:

```__constant float COS_8 = 0.923879532511f; // cos(Pi/8)
__constant float SIN_8 = 0.382683432365f; // sin(Pi/8)
#define mul_p0q8 mul_p0q4
float2  mul_p1q8(float2 a) { return mul_1((float2)(COS_8,-SIN_8),a); }
#define mul_p2q8 mul_p1q4
float2  mul_p3q8(float2 a) { return mul_1((float2)(SIN_8,-COS_8),a); }
#define mul_p4q8 mul_p2q4
float2  mul_p5q8(float2 a) { return mul_1((float2)(-SIN_8,-COS_8),a); }
#define mul_p6q8 mul_p3q4
float2  mul_p7q8(float2 a) { return mul_1((float2)(-COS_8,-SIN_8),a); }

// Compute in-place DFT2 and twiddle
#define DFT2_TWIDDLE(a,b,t) { float2 tmp = t(a-b); a += b; b = tmp; }
```

Then the radix-16 kernel is the following. We used an array of registers, in the hope the compiler would handle it correctly. Fixed-size arrays and loops are OK; but in some other cases, the private array may be physically located in global memory, annihilating kernel performance.

```// T = N/16 = number of threads.
// P is the length of input sub-sequences, 1,16,256,...,N/16.
__kernel void fft_radix16(__global const float2 * x,__global float2 * y,int p)
{
int t = get_global_size(0); // number of threads
int i = get_global_id(0); // current thread
int k = i & (p-1); // index in input sequence, in 0..P-1
// Inputs indices are I+{0,..,15}*T
x += i;
// Output indices are J+{0,..,15}*P, where
// J is I with four 0 bits inserted at bit log2(P)
y += ((i-k)<<4) + k;

float2 u;
for (int m=0;m<16;m++) u[m] = x[m*t];

// Twiddle, twiddling factors are exp(_I*PI*{0,..,15}*K/4P)
float alpha = -M_PI*(float)k/(float)(8*p);
for (int m=1;m<16;m++) u[m] = mul_1(exp_alpha_1(m*alpha),u[m]);

// 8x in-place DFT2 and twiddle (1)
DFT2_TWIDDLE(u,u,mul_p0q8);
DFT2_TWIDDLE(u,u,mul_p1q8);
DFT2_TWIDDLE(u,u,mul_p2q8);
DFT2_TWIDDLE(u,u,mul_p3q8);
DFT2_TWIDDLE(u,u,mul_p4q8);
DFT2_TWIDDLE(u,u,mul_p5q8);
DFT2_TWIDDLE(u,u,mul_p6q8);
DFT2_TWIDDLE(u,u,mul_p7q8);

// 8x in-place DFT2 and twiddle (2)
DFT2_TWIDDLE(u,u,mul_p0q4);
DFT2_TWIDDLE(u,u,mul_p1q4);
DFT2_TWIDDLE(u,u,mul_p2q4);
DFT2_TWIDDLE(u,u,mul_p3q4);
DFT2_TWIDDLE(u,u,mul_p0q4);
DFT2_TWIDDLE(u,u,mul_p1q4);
DFT2_TWIDDLE(u,u,mul_p2q4);
DFT2_TWIDDLE(u,u,mul_p3q4);

// 8x in-place DFT2 and twiddle (3)
DFT2_TWIDDLE(u,u,mul_p0q2);
DFT2_TWIDDLE(u,u,mul_p1q2);
DFT2_TWIDDLE(u,u,mul_p0q2);
DFT2_TWIDDLE(u,u,mul_p1q2);
DFT2_TWIDDLE(u,u,mul_p0q2);
DFT2_TWIDDLE(u,u,mul_p1q2);
DFT2_TWIDDLE(u,u,mul_p0q2);
DFT2_TWIDDLE(u,u,mul_p1q2);

// 8x DFT2 and store (reverse binary permutation)
y    = u  + u;
y[p]    = u  + u;
y[2*p]  = u  + u;
y[3*p]  = u + u;
y[4*p]  = u  + u;
y[5*p]  = u + u;
y[6*p]  = u  + u;
y[7*p]  = u + u;
y[8*p]  = u  - u;
y[9*p]  = u  - u;
y[10*p] = u  - u;
y[11*p] = u - u;
y[12*p] = u  - u;
y[13*p] = u - u;
y[14*p] = u  - u;
y[15*p] = u - u;
}
```
sincos 46 ms29 ms
sin+cos 45 ms38 ms
native sin+cos45 ms18 ms
Running times for the radix-16 kernel, N=224. The kernel is called 6 times. The timings reported here are real time in the host thread. The cumulated kernel profiling time returned by the ATI OpenCL driver can reach 16.7 ms. Timings without profiling are slightly better (see graph below).

Things are becoming quite interesting here: the HD5870 beats the performance of CUBLAS! Strangely, the native sin+cos is now faster than sincos. Each kernel continues to execute in 3 ms despite the increased computation load. Juste to see how far it can go before the number of registers becomes a limiting factor, let's implement a radix-32 kernel.

The problem with the radix-32 kernel is how we can test it: we can't allocate larger buffers on the HD5870, so we are limited to N=224, and 24 is not a multiple of 5. So we will have to call 4x radix-32 and 1x radix-16 to reach 24.

The code for the radix-32 kernel is a trivial evolution from the previous ones. We get the following running times: