Table of Contents
Fetching ...

SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts

Jacob Fein-Ashley, Neelesh Gupta, Rajgopal Kannan, Viktor Prasanna

TL;DR

SPECTRE tackles the long-context bottleneck of transformers by replacing self-attention with a real FFT-based token mixer that uses content-adaptive spectral gates, achieving per-layer cost $O(n\log n)$. A Prefix–FFT cache enables efficient autoregressive decoding, while an optional Wavelet Refinement Module restores local detail without compromising the asymptotic efficiency. The approach preserves the original architecture and requires fewer than 6% additional parameters, enabling hundred-kilotoken contexts on commodity GPUs with strong throughput gains (up to $7\times$ faster than FlashAttention-2 at 128k tokens) and minimal loss in accuracy on PG‑19 and ImageNet‑1k. With persistent memory options and easy fine-tuning, SPECTRE provides an immediate, drop-in upgrade path for long-context models, potentially broadening practical use of large transformers in real-world tasks.

Abstract

Long-context transformers face significant efficiency challenges due to the quadratic cost of self-attention. However, many modern applications-from multi-turn dialogue to high-resolution vision-require contexts spanning tens of thousands of tokens. We introduce SPECTRE, a method that replaces each attention head with a fast real FFT, a content-adaptive spectral gate, and an inverse FFT, reducing per-layer complexity from $\mathcal{O}(L^{2})$ to $O(L\log L)$ while preserving the surrounding architecture. We extend this efficiency to autoregressive generation through our Prefix-FFT cache and enhance local feature representation with an optional wavelet module that adds negligible computational overhead. Our experiments demonstrate that SPECTRE operates up to 7$\times$ faster than FlashAttention-2 on 128k-token contexts while matching or exceeding baseline performance on PG-19 language modeling and ImageNet-1k classification tasks. SPECTRE achieves these improvements by adding fewer than 6\% parameters to the base model, making hundred-kilotoken context processing feasible on commodity GPUs without specialized hardware.

SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts

TL;DR

SPECTRE tackles the long-context bottleneck of transformers by replacing self-attention with a real FFT-based token mixer that uses content-adaptive spectral gates, achieving per-layer cost . A Prefix–FFT cache enables efficient autoregressive decoding, while an optional Wavelet Refinement Module restores local detail without compromising the asymptotic efficiency. The approach preserves the original architecture and requires fewer than 6% additional parameters, enabling hundred-kilotoken contexts on commodity GPUs with strong throughput gains (up to faster than FlashAttention-2 at 128k tokens) and minimal loss in accuracy on PG‑19 and ImageNet‑1k. With persistent memory options and easy fine-tuning, SPECTRE provides an immediate, drop-in upgrade path for long-context models, potentially broadening practical use of large transformers in real-world tasks.

Abstract

Long-context transformers face significant efficiency challenges due to the quadratic cost of self-attention. However, many modern applications-from multi-turn dialogue to high-resolution vision-require contexts spanning tens of thousands of tokens. We introduce SPECTRE, a method that replaces each attention head with a fast real FFT, a content-adaptive spectral gate, and an inverse FFT, reducing per-layer complexity from to while preserving the surrounding architecture. We extend this efficiency to autoregressive generation through our Prefix-FFT cache and enhance local feature representation with an optional wavelet module that adds negligible computational overhead. Our experiments demonstrate that SPECTRE operates up to 7 faster than FlashAttention-2 on 128k-token contexts while matching or exceeding baseline performance on PG-19 language modeling and ImageNet-1k classification tasks. SPECTRE achieves these improvements by adding fewer than 6\% parameters to the base model, making hundred-kilotoken context processing feasible on commodity GPUs without specialized hardware.

Paper Structure

This paper contains 52 sections, 2 theorems, 11 equations, 6 figures, 7 tables, 1 algorithm.

Key Result

Theorem B.1

Let $x=(x_{0},\dots,x_{n-1})\in\mathbb{R}^{n}$ be a real-valued sequence and define its discrete Fourier transform (DFT) Then the spectrum satisfies the Hermitian symmetry where $(\cdot)^{*}$ denotes complex conjugation.

Figures (6)

  • Figure 1: Inference scaling of a Llama-3.2-1B model equipped with three different attention kernels. We fine-tune an identical backbone with (i) standard softmax-dot-product attention (SDPA, blue), (ii)FlashAttention-2dao2023flashattention (grey), and (iii) the proposed SPECTRE mixer (red). After training, we measure tokens-per-second throughput (left) and single-batch latency (right) on an NVIDIA A100-80 GB for sequence lengths $L\!\in\!\{512,\,1\mathrm{k},\,4\mathrm{k},\,8\mathrm{k},\,32\mathrm{k},\,128\mathrm{k}\}$. Dashed black lines show the ideal $\mathcal{O}(n^{2})$ and $\mathcal{O}(n\log n)$ slopes. Higher throughput and lower latency are better (green arrows). SPECTRE retains the accuracy of the backbone yet delivers near‐$\mathcal{O}(n\log n)$ runtime— remaining flat up to $32$k tokens and sustaining a $7\times$ speed-up over FlashAttention-2 at the extreme $128$k-token setting.
  • Figure 2: Real FFT: an 8-sample real sequence maps to $(n/2{+}1)$ unique coefficients; the shaded half is redundant.
  • Figure 3: Drop-in SPECTRE layer. The SPECTRE mixing block (pink) and the optional Wavelet Refinement Module (WRM) can be inserted between the embedding layer and the feed-forward network (FFN) of a standard Transformer. A Prefix–FFT cache (green, dashed) mirrors the attention KV-cache, enabling efficient autoregressive decoding without altering residual pathways or layer normalization placements. Existing checkpoints therefore, require only minimal fine-tuning to upgrade from attention to SPECTRE.
  • Figure 4: SPECTRE’s frequency-domain token mixing. Token embeddings are projected, transformed via a real FFT, gated per frequency by a content-adaptive diagonal mask (with positional phase), and returned to token space using an inverse FFT. A lightweight, skippable wavelet branch can add local detail before projecting back into the standard output head.
  • Figure 5: End-to-end efficiency at two sequence lengths.Left: Single-batch latency (ms; $\downarrow$ lower is better) for $L{=}4$k and $L{=}32$k tokens. Right: Throughput in tokens per second ($\uparrow$ higher is better) for the same lengths. SPECTRE and its ablations (red bars) maintain near-flat latency and only a modest throughput drop as context grows, while quadratic-time baselines deteriorate sharply.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Theorem B.1: Hermitian symmetry of the DFT
  • proof
  • Corollary B.2: Sufficient statistics of the half spectrum
  • proof
  • Remark 1: Odd $n$