The Condensate Theorem: Transformers are O(n), Not $O(n^2)$

Jorge L. Ruiz Williams

The Condensate Theorem: Transformers are O(n), Not $O(n^2)$

Jorge L. Ruiz Williams

TL;DR

The Condensate Theorem shows that transformer attention is effectively sparse in trained models, with most mass focused on a finite Condensate Set defined by an Anchor, a Local Window, and a dynamic Top-$k$ subset selected via $QK^T$ scores. This allows a lossless reduction from $O(n^2)$ to $O(n)$ attention by computing softmax over the condensate per query, yielding exact bit-identical outputs in IEEE 754 float32 for greedy decoding across 12 architectures. The authors validate exact token equivalence, universal applicability to RoPE/GQA, and dramatic speedups (e.g., 159x at 131K tokens, with projected improvements at 1M tokens), along with substantial KV cache compression. The work suggests quadratic bottlenecks are artifacts of naive implementation rather than intrinsic intelligence, enabling practical, scalable long-context inference without retraining. Practically, this unlocks massive efficiency gains and cost reductions for long-sequence tasks while preserving model behavior. $O(n^2)$ is thus not a fundamental limit of transformer attention; it is an artifact that can be avoided with learned, topology-aware sparsity.

Abstract

We present the Condensate Theorem: attention sparsity is a learned topological property, not an architectural constraint. Through empirical analysis of trained language models, we find that attention mass concentrates on a distinct topological manifold -- and this manifold can be identified dynamically without checking every position. We prove a general result: for any query, projecting attention onto the Condensate Manifold (Anchor + Window + Dynamic Top-k) achieves 100% output equivalence with full $O(n^2)$ attention. This is not an approximation -- it is lossless parity. We validate this across GPT-2, Pythia, Qwen2, TinyLlama, and Mistral, demonstrating bit-exact token matching on 1,500+ generated tokens. By mapping this topology to hardware, our Topological Attention kernel achieves a 159x measured speedup at 131K tokens (3.94ms vs 628ms) and a projected >1,200x speedup at 1M tokens, reducing inference costs by >99.9% compared to Flash Attention. We conclude that the quadratic bottleneck is an artifact of naive implementation, not intelligence.

The Condensate Theorem: Transformers are O(n), Not $O(n^2)$

TL;DR

subset selected via

scores. This allows a lossless reduction from

attention by computing softmax over the condensate per query, yielding exact bit-identical outputs in IEEE 754 float32 for greedy decoding across 12 architectures. The authors validate exact token equivalence, universal applicability to RoPE/GQA, and dramatic speedups (e.g., 159x at 131K tokens, with projected improvements at 1M tokens), along with substantial KV cache compression. The work suggests quadratic bottlenecks are artifacts of naive implementation rather than intrinsic intelligence, enabling practical, scalable long-context inference without retraining. Practically, this unlocks massive efficiency gains and cost reductions for long-sequence tasks while preserving model behavior.

is thus not a fundamental limit of transformer attention; it is an artifact that can be avoided with learned, topology-aware sparsity.

Abstract

attention. This is not an approximation -- it is lossless parity. We validate this across GPT-2, Pythia, Qwen2, TinyLlama, and Mistral, demonstrating bit-exact token matching on 1,500+ generated tokens. By mapping this topology to hardware, our Topological Attention kernel achieves a 159x measured speedup at 131K tokens (3.94ms vs 628ms) and a projected >1,200x speedup at 1M tokens, reducing inference costs by >99.9% compared to Flash Attention. We conclude that the quadratic bottleneck is an artifact of naive implementation, not intelligence.

Paper Structure (34 sections, 3 theorems, 5 equations, 1 figure, 10 tables, 1 algorithm)

This paper contains 34 sections, 3 theorems, 5 equations, 1 figure, 10 tables, 1 algorithm.

Introduction
The Core Observation
Implications
Prior Work and Our Contribution
Summary of Contributions
The Condensate Theorem
Definitions
The Theorem (Empirical Law)
Empirical Validation
Methodology
Exact Token Equivalence (GPT-2)
Cross-Architecture Coverage (RoPE + GQA)
Attention Mass Distribution
Scaling with Sequence Length
Extreme Scaling: O(N) vs O(N$^2$)
...and 19 more sections

Key Result

Theorem 1

For trained autoregressive language models, attention is effectively sparse. There exists a set $\mathcal{C}$ with $|\mathcal{C}| \ll n$ such that: for all queries tested across GPT-2, Pythia, Qwen2, TinyLlama, and Mistral families (12 architectures, 1,500+ tokens). Consequently, $\arg\max$ predictions are identical under greedy decoding. This set $\mathcal{C}$ is identified by the union of the A

Figures (1)

Figure 1: Log-log plot of inference time vs sequence length. SDPA follows quadratic scaling (slope $\approx 2$), while our Sparse kernel follows linear scaling (slope $\approx 1$). At 131K tokens: 628ms vs 3.94ms = 159$\times$ speedup.

Theorems & Definitions (5)

Definition 1: The Condensate Set
Definition 2: Sparse Attention Output
Theorem 1: Condensate Theorem
Corollary 2: The Finite Support Principle
Corollary 3: Practical Adaptive Rule

The Condensate Theorem: Transformers are O(n), Not $O(n^2)$

TL;DR

Abstract

The Condensate Theorem: Transformers are O(n), Not $O(n^2)$

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (5)