OpenCL Fast Fourier Transform

Eric Bainville - May 2010

Benchmarking the radix-r kernels

Let's benchmark our OpenCL kernels using the best variants of the code for each board. On the GTX285, use a mix of radix-2, radix-4 variant C, and radix-8 variant B, all with native sin+cos. On the HD5870, use a mix of radix-2, radix-4 variant A, radix-8 variant A, and optionally radix-16, all with native sin+cos. We disabled OpenCL kernel profiling to get the best running times. I tried several variants (in max radix, code variant, sin+cos) and reported the lowest time reached for each N.

Benchmarks of the OpenCL kernels on both GTX285 and HD5870, and CUFFT on the GTX285. We report the Gflop/s in function of log2(N), assuming 5.N.log2(N) flop for a DFT of size N. The measured time is the wall clock time in the host thread (not kernel profiling times), and data transferts are not counted.

The multi-kernel with no shared memory approach provides good levels of performance on the ATI HD5870, matching CUFFT on the GTX285 for large dimensions.

Even more interesting are the performances measured including host-device transferts. This is what matters to the user, unless more pre- or post- processing of the data is executed on the GPU too.

Benchmarks of the OpenCL kernels on both GTX285 and HD5870, and CUFFT on the GTX285, and Intel MKL FFT on the Core i7. The measured time is the wall clock time in the host thread, including host-device data transferts.

In the next page, we will allocate all threads of a single work group to the computation of a larger radix transformation: One work-group per DFT (1).