The Condensate Theorem: Transformers are O(n), Not $O(n^2)$
Jorge L. Ruiz Williams
TL;DR
The Condensate Theorem shows that transformer attention is effectively sparse in trained models, with most mass focused on a finite Condensate Set defined by an Anchor, a Local Window, and a dynamic Top-$k$ subset selected via $QK^T$ scores. This allows a lossless reduction from $O(n^2)$ to $O(n)$ attention by computing softmax over the condensate per query, yielding exact bit-identical outputs in IEEE 754 float32 for greedy decoding across 12 architectures. The authors validate exact token equivalence, universal applicability to RoPE/GQA, and dramatic speedups (e.g., 159x at 131K tokens, with projected improvements at 1M tokens), along with substantial KV cache compression. The work suggests quadratic bottlenecks are artifacts of naive implementation rather than intrinsic intelligence, enabling practical, scalable long-context inference without retraining. Practically, this unlocks massive efficiency gains and cost reductions for long-sequence tasks while preserving model behavior. $O(n^2)$ is thus not a fundamental limit of transformer attention; it is an artifact that can be avoided with learned, topology-aware sparsity.
Abstract
We present the Condensate Theorem: attention sparsity is a learned topological property, not an architectural constraint. Through empirical analysis of trained language models, we find that attention mass concentrates on a distinct topological manifold -- and this manifold can be identified dynamically without checking every position. We prove a general result: for any query, projecting attention onto the Condensate Manifold (Anchor + Window + Dynamic Top-k) achieves 100% output equivalence with full $O(n^2)$ attention. This is not an approximation -- it is lossless parity. We validate this across GPT-2, Pythia, Qwen2, TinyLlama, and Mistral, demonstrating bit-exact token matching on 1,500+ generated tokens. By mapping this topology to hardware, our Topological Attention kernel achieves a 159x measured speedup at 131K tokens (3.94ms vs 628ms) and a projected >1,200x speedup at 1M tokens, reducing inference costs by >99.9% compared to Flash Attention. We conclude that the quadratic bottleneck is an artifact of naive implementation, not intelligence.
