SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts

Jacob Fein-Ashley; Neelesh Gupta; Rajgopal Kannan; Viktor Prasanna

SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts

Jacob Fein-Ashley, Neelesh Gupta, Rajgopal Kannan, Viktor Prasanna

TL;DR

SPECTRE tackles the long-context bottleneck of transformers by replacing self-attention with a real FFT-based token mixer that uses content-adaptive spectral gates, achieving per-layer cost $O(n\log n)$. A Prefix–FFT cache enables efficient autoregressive decoding, while an optional Wavelet Refinement Module restores local detail without compromising the asymptotic efficiency. The approach preserves the original architecture and requires fewer than 6% additional parameters, enabling hundred-kilotoken contexts on commodity GPUs with strong throughput gains (up to $7\times$ faster than FlashAttention-2 at 128k tokens) and minimal loss in accuracy on PG‑19 and ImageNet‑1k. With persistent memory options and easy fine-tuning, SPECTRE provides an immediate, drop-in upgrade path for long-context models, potentially broadening practical use of large transformers in real-world tasks.

Abstract

Long-context transformers face significant efficiency challenges due to the quadratic cost of self-attention. However, many modern applications-from multi-turn dialogue to high-resolution vision-require contexts spanning tens of thousands of tokens. We introduce SPECTRE, a method that replaces each attention head with a fast real FFT, a content-adaptive spectral gate, and an inverse FFT, reducing per-layer complexity from $\mathcal{O}(L^{2})$ to $O(L\log L)$ while preserving the surrounding architecture. We extend this efficiency to autoregressive generation through our Prefix-FFT cache and enhance local feature representation with an optional wavelet module that adds negligible computational overhead. Our experiments demonstrate that SPECTRE operates up to 7$\times$ faster than FlashAttention-2 on 128k-token contexts while matching or exceeding baseline performance on PG-19 language modeling and ImageNet-1k classification tasks. SPECTRE achieves these improvements by adding fewer than 6\% parameters to the base model, making hundred-kilotoken context processing feasible on commodity GPUs without specialized hardware.

SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts

TL;DR

SPECTRE tackles the long-context bottleneck of transformers by replacing self-attention with a real FFT-based token mixer that uses content-adaptive spectral gates, achieving per-layer cost

. A Prefix–FFT cache enables efficient autoregressive decoding, while an optional Wavelet Refinement Module restores local detail without compromising the asymptotic efficiency. The approach preserves the original architecture and requires fewer than 6% additional parameters, enabling hundred-kilotoken contexts on commodity GPUs with strong throughput gains (up to

faster than FlashAttention-2 at 128k tokens) and minimal loss in accuracy on PG‑19 and ImageNet‑1k. With persistent memory options and easy fine-tuning, SPECTRE provides an immediate, drop-in upgrade path for long-context models, potentially broadening practical use of large transformers in real-world tasks.

Abstract

while preserving the surrounding architecture. We extend this efficiency to autoregressive generation through our Prefix-FFT cache and enhance local feature representation with an optional wavelet module that adds negligible computational overhead. Our experiments demonstrate that SPECTRE operates up to 7

faster than FlashAttention-2 on 128k-token contexts while matching or exceeding baseline performance on PG-19 language modeling and ImageNet-1k classification tasks. SPECTRE achieves these improvements by adding fewer than 6\% parameters to the base model, making hundred-kilotoken context processing feasible on commodity GPUs without specialized hardware.

SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts

TL;DR

Abstract

SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (5)