Table of Contents
Fetching ...

Shortest-Path FFT: Optimal SIMD Instruction Scheduling via Graph Search

Mohamed Amine Bergach

Abstract

An $N$-point FFT admits many valid implementations that differ in radix choice, stage ordering, and register-blocking strategy. These alternatives use different SIMD instruction mixes with different latencies, yet produce the same mathematical result. We show that finding the fastest implementation is a shortest-path problem on a directed acyclic graph. We formalize two variants of this graph. In the \emph{context-free} model, nodes represent computation stages and edge weights are independently measured instruction costs. In the \emph{context-aware} model, nodes are expanded to encode the \emph{predecessor edge type}, so that edge weights capture inter-operation correlations such as cache warming -- the cost of operation~B depends on which operation~A preceded it. This addresses a limitation identified but deliberately bypassed by FFTW \citep{FrigoJohnson1998}: that optimal-substructure assumptions break down ``because of the different states of the cache.'' Applied to Apple M1 NEON, the context-free Dijkstra finds an arrangement at 22.1~GFLOPS (74\% of optimal). The context-aware Dijkstra discovers $\text{R4} \to \text{R2} \to \text{R4} \to \text{R4} \to \text{Fused-8}$ at 29.8~GFLOPS -- a $5.2\times$ improvement over pure radix-2 and 34\% faster than the context-free result. This arrangement includes a radix-2 pass \emph{sandwiched between} radix-4 passes, exploiting cache residuals that only exist in context. No context-free search can discover this.

Shortest-Path FFT: Optimal SIMD Instruction Scheduling via Graph Search

Abstract

An -point FFT admits many valid implementations that differ in radix choice, stage ordering, and register-blocking strategy. These alternatives use different SIMD instruction mixes with different latencies, yet produce the same mathematical result. We show that finding the fastest implementation is a shortest-path problem on a directed acyclic graph. We formalize two variants of this graph. In the \emph{context-free} model, nodes represent computation stages and edge weights are independently measured instruction costs. In the \emph{context-aware} model, nodes are expanded to encode the \emph{predecessor edge type}, so that edge weights capture inter-operation correlations such as cache warming -- the cost of operation~B depends on which operation~A preceded it. This addresses a limitation identified but deliberately bypassed by FFTW \citep{FrigoJohnson1998}: that optimal-substructure assumptions break down ``because of the different states of the cache.'' Applied to Apple M1 NEON, the context-free Dijkstra finds an arrangement at 22.1~GFLOPS (74\% of optimal). The context-aware Dijkstra discovers at 29.8~GFLOPS -- a improvement over pure radix-2 and 34\% faster than the context-free result. This arrangement includes a radix-2 pass \emph{sandwiched between} radix-4 passes, exploiting cache residuals that only exist in context. No context-free search can discover this.

Paper Structure

This paper contains 21 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Context-free computation graph for $N = 1024$ ($L = 10$). Edges: radix-2 (blue), radix-4 (orange), radix-8 (red), fused register blocks (green). A path from 0 to 10 is a complete FFT; the shortest path is the fastest. Subset of 30+ edges shown.
  • Figure 2: Context-aware graph (partial view). Each node is expanded by predecessor type: $(3, \text{R2})$ means "stage 3, reached via R2." The edge from $(2, \text{R4})$ to $(3, \text{R2})$ captures R2's cost after R4's cache residual. Optimal path (red): R4 $\to$ R2 $\to$ R4 $\to$ R4 $\to$ F8.
  • Figure 3: Three decompositions for $N = 1024$. Top: pure radix-2 (10 passes, 5.7 GF). Middle: context-free Dijkstra (R4+F8+F32, 22.1 GF). Bottom: context-aware Dijkstra (R4$\to$R2$\to$R4$\to$R4$\to$F8, 29.8 GF). Dashed box: the radix-2 pass exploiting cache residuals from the preceding R4.