Implementing FFTs in Practice

Steven G. Johnson; Matteo Frigo

Implementing FFTs in Practice

Steven G. Johnson, Matteo Frigo

TL;DR

A high-level overview of some of the engineering considerations that arise in high-performance implementations of fast Fourier trasnforms (FFTs) and why optimized FFTs are very different from textbook"radix-2 Cooley-Tukey"FFT algorithms.

Abstract

This review article was first published in 2008 as chapter 11 in the book "Fast Fourier Transforms," edited by C. S. Burrus, for the Connexions project at Rice University, which is sadly no longer online. It gives a high-level overview of some of the engineering considerations that arise in high-performance implementations of fast Fourier trasnforms (FFTs). It explains why optimized FFTs are very different from textbook "radix-2 Cooley-Tukey" FFT algorithms, in order to compensate for the memory hierarchy and exploit the large register sets and deep pipelines of modern CPUs. Using the FFTW library as a case study, it talks about tradeoffs in the use of recursion, generation of twiddle factors, code generation, and other algorithmic choices.

Implementing FFTs in Practice

TL;DR

Abstract

Paper Structure (24 sections, 8 equations, 3 figures, 1 algorithm)

This paper contains 24 sections, 8 equations, 3 figures, 1 algorithm.

Introduction
Review of the Cooley--Tukey FFT
FFTs and the Memory Hierarchy
Understanding FFTs with an ideal cache
Cache-obliviousness in practice
Memory strategies in FFTW
Adaptive Composition of FFT Algorithms
The problem to be solved
DFT problem examples
The space of plans in FFTW
Rank-0 plans
Rank-1 plans
Direct plans
Cooley--Tukey plans
Plans for higher vector ranks
...and 9 more sections

Figures (3)

Figure 1: The ratio of speed (1/time) between a highly optimized FFT (FFTW 3.1.2 FFTWwebFFTW05) and a typical textbook radix-2 implementation ( Numerical Recipes in CPressFlaTeu92) on a 3 GHz Intel Core Duo with the Intel C compiler 9.1.043, for single-precision complex-data DFTs of size $n$, plotted versus $\log_2 n$. Top line (squares) shows FFTW with SSE SIMD instructions enabled, which perform multiple arithmetic operations at once (see section \ref{['sec:genfft:simd']}); bottom line (circles) shows FFTW with SSE disabled, which thus requires a similar number of arithmetic instructions to the textbook code. (This is not intended as a criticism of Numerical Recipes---simple radix-2 implementations are reasonable for pedagogy---but it illustrates the radical differences between straightforward and optimized implementations of FFT algorithms, even with similar arithmetic costs.) For $n \gtrsim 2^{19}$, the ratio increases because the textbook code becomes much slower (this happens when the DFT size exceeds the level-2 cache).
Figure 2: Schematic of traditional breadth-first (left) vs. recursive depth-first (right) ordering for radix-2 FFT of size 8: the computations for each nested box are completed before doing anything else in the surrounding box. Breadth-first computation performs all butterflies of a given size at once, while depth-first computation completes one subtransform entirely before moving on to the next (as in algorithm \ref{['alg:radix2-rec']}).
Figure 3: Two possible decompositions for a size-30 DFT, both for the arbitrary choice of DIT radices 3 then 2 then 5, and prime-size codelets. Items grouped by a "$\{$" result from the plan for a single sub-problem. In the depth-first case, the vector rank was reduced to 0 as per section \ref{['sec:struct:vrankr']} before decomposing sub-problems, and vice-versa in the breadth-first case.

Implementing FFTs in Practice

TL;DR

Abstract

Implementing FFTs in Practice

Authors

TL;DR

Abstract

Table of Contents

Figures (3)