OpenCL Fast Fourier Transform

Eric Bainville - May 2010


In this article, we will experiment various OpenCL implementations of one-dimensional Fast Fourier Transform (FFT) algorithms.

This article is becoming rather long now. It was written in several sessions, interrupted by other higher priority tasks (i.e. paid contracts). The objective is to reach an FFT implementation in OpenCL with performances at least matching NVidia CUFFT, and running on both ATI and NVidia GPU's.

This objective is reached in the Radix 4,8,16,32 kernels page, using several chained calls of simple kernels not using shared memory. The experiments on kernels using shared memory are in progress, and the last two pages of the articles will be updated.


  1. Introduction - Introduction.
  2. DFT - Discrete Fourier Transform.
  3. FFT - Fast Fourier Transform algorithms.
  4. Reference implementations - FFTW, Intel MKL, and NVidia CUFFT.
  5. Radix-2 kernel - Simple radix-2 OpenCL kernel.
  6. Radix 4,8,16,32 kernels - Extension to radix-4,8,16, and 32 kernels.
  7. Radix-r kernels benchmarks - Benchmarks of the radix-r kernels.
  8. One work-group per DFT (1) - One DFT2r per work-group of size r, values in local memory.
  9. One work-group per DFT (2) - One DFT2r per work-group of size r, values in registers, transferts using local memory.



Execution times are measured on the following hardware:

  Intel Core 2 Quad Q9550 (4 cores) @2.83 GHz (stock speed)
  Chipset Intel P45
  12GB of DDR2 @800 MHz
  Linux 64-bit kernel-2.6.32
  glibc-2.10.1 gcc-4.3.4 fftw-3.2.2 mkl-

Core i7:
  Intel Core i7 920 (4 cores, 8 threads) @3.33 GHz (overclocked)
  Chipset Intel X58
  6GB of DDR3 @1.33 GHz
  Windows 7 Ultimate 64-bit
  vs-2010 mkl-

  NVidia GTX285 1GB
  Chipset Intel P45
  Linux 64-bit kernel-2.6.32
  NVidia driver 195.36.24 + CUDA toolkit 3.0
  OpenCL compiler option -cl-fast-relaxed-math

  ATI Radeon HD5870 1GB
  Chipset Intel X58
  Windows 7 Ultimate 64-bit
  ATI Catalyst 10.5 + Stream SDK 2.1