Table of Contents
Fetching ...

LUCID: Attention with Preconditioned Representations

Sai Surya Duvvuri, Nirmal Patel, Nilesh Gupta, Inderjit S. Dhillon

TL;DR

LUCID tackles the degradation of softmax attention in long contexts by introducing an RKHS-based preconditioner that decorrelates keys, enabling precise retrieval without sacrificing gradient flow. Derived from a quadratic retrieval objective, LUCID yields a triangular solve preconditioner that preserves $\mathcal{O}(N^2D)$ complexity while sharpening attention. Empirically, LUCID and its PaTH-enhanced variant achieve significant gains on long-context retrieval tasks (e.g., BABILong and RULER) with modest training overhead and near-identical inference latency. The approach demonstrates robust performance across needle-in-a-haystack benchmarks, long-document reasoning, and multi-document understanding, highlighting the practical impact for scalable long-context LLMs.

Abstract

Softmax-based dot-product attention is a cornerstone of Transformer architectures, enabling remarkable capabilities such as in-context learning. However, as context lengths increase, a fundamental limitation of the softmax function emerges: it tends to diffuse probability mass to irrelevant tokens degrading performance in long-sequence scenarios. Furthermore, attempts to sharpen focus by lowering softmax temperature hinder learnability due to vanishing gradients. We introduce LUCID Attention, an architectural modification that applies a preconditioner to the attention probabilities. This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space, thus allowing the query to focus on important keys among large number of keys accurately with same computational complexity as standard attention. Additionally, LUCID's preconditioning-based approach to retrieval bypasses the need for low temperature and the learnability problems associated with it. We validate our approach by training ~1 billion parameter language models evaluated on up to 128K tokens. Our results demonstrate significant gains on long-context retrieval tasks, specifically retrieval tasks from BABILong, RULER, SCROLLS and LongBench. For instance, LUCID achieves up to 18% improvement in BABILong and 14% improvement in RULER multi-needle performance compared to standard attention.

LUCID: Attention with Preconditioned Representations

TL;DR

LUCID tackles the degradation of softmax attention in long contexts by introducing an RKHS-based preconditioner that decorrelates keys, enabling precise retrieval without sacrificing gradient flow. Derived from a quadratic retrieval objective, LUCID yields a triangular solve preconditioner that preserves complexity while sharpening attention. Empirically, LUCID and its PaTH-enhanced variant achieve significant gains on long-context retrieval tasks (e.g., BABILong and RULER) with modest training overhead and near-identical inference latency. The approach demonstrates robust performance across needle-in-a-haystack benchmarks, long-document reasoning, and multi-document understanding, highlighting the practical impact for scalable long-context LLMs.

Abstract

Softmax-based dot-product attention is a cornerstone of Transformer architectures, enabling remarkable capabilities such as in-context learning. However, as context lengths increase, a fundamental limitation of the softmax function emerges: it tends to diffuse probability mass to irrelevant tokens degrading performance in long-sequence scenarios. Furthermore, attempts to sharpen focus by lowering softmax temperature hinder learnability due to vanishing gradients. We introduce LUCID Attention, an architectural modification that applies a preconditioner to the attention probabilities. This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space, thus allowing the query to focus on important keys among large number of keys accurately with same computational complexity as standard attention. Additionally, LUCID's preconditioning-based approach to retrieval bypasses the need for low temperature and the learnability problems associated with it. We validate our approach by training ~1 billion parameter language models evaluated on up to 128K tokens. Our results demonstrate significant gains on long-context retrieval tasks, specifically retrieval tasks from BABILong, RULER, SCROLLS and LongBench. For instance, LUCID achieves up to 18% improvement in BABILong and 14% improvement in RULER multi-needle performance compared to standard attention.
Paper Structure (47 sections, 1 theorem, 26 equations, 7 figures, 7 tables, 9 algorithms)

This paper contains 47 sections, 1 theorem, 26 equations, 7 figures, 7 tables, 9 algorithms.

Key Result

Theorem 1

Let ${\mathbf{o}}$ be the LUCID attention output (before multiplying by $V$): Assume $K \neq 0$ and at least one column of $\text{diag}({\mathbf{a}}) - {\mathbf{a}}{\mathbf{a}}^\top$ is not in the null-space of $K^\top$, where ${\mathbf{a}} = \text{softmax}({\mathbf{q}} K^\top / \sqrt{d})$. Then ${\partial {\mathbf{o}}}/{\partial {\mathbf{q}}} \neq 0$.

Figures (7)

  • Figure 1: Top: Challenges with softmax attention. The attention entropy must lie in a narrow operating zone---high entropy leads to uniform aggregation and representation collapse, while low entropy causes vanishing gradients. Even within this zone, correlated keys create attention noise that hinders retrieval of relevant tokens (needles) from irrelevant context. The condition number $\kappa(\text{tril}(\exp(KK^\top)))$ grows with sequence length, indicating increasing key correlation. Bottom: LUCID addresses this by constructing a preconditioner $P = (M \circ \exp(KK^\top))^{-1}$ that decorrelates keys in RKHS, sharpening attention on relevant tokens. The resulting attention mechanism (right) combines standard attention weights with the preconditioner using causal masking. The $P^{-1}V$ computation is performed efficiently via torch.linalg.solve_triangular (cuBLAS TRSM kernel), exploiting the lower-triangular structure of $P$..
  • Figure 2: Condition number $\kappa$ of the LUCID preconditioner matrix grows with sequence length. Higher $\kappa$ indicates stronger key correlations, where LUCID's correction becomes more essential.
  • Figure 3: Sequential task learning reveals the learnability-retrieval tradeoff.Left: Training loss across two phases. Both methods solve Phase 1 (self-retrieval), but only LUCID adapts to Phase 2 (cumulative averaging). Right: Off-diagonal Jacobian magnitude (log scale). Standard softmax reduces its Jacobian by ${\sim}10^3\times$ during Phase 1 to achieve sharpness, blocking gradient flow in Phase 2. LUCID maintains higher Jacobian values throughout, enabling rapid adaptation.
  • Figure 4: Performance comparison on MNIAH across varying number of needles and sequence lengths. Left: Standard Attention accuracy degrades sharply as task difficulty increases (more needles, longer sequences), dropping to 11.4% in the hardest configuration. Middle: LUCID Attention maintains substantially higher accuracy across all settings. Right: The difference highlights consistent improvements of 10--26%, with LUCID providing the largest gains at longer sequence lengths where Standard Attention struggles most.
  • Figure 5: Multi-needle retrieval accuracy improves with longer finetuning for LUCID. Models finetuned at 32K and 64K sequence lengths are evaluated on multi-needle tasks with contexts averaged over 2K-8K and 2K-32K. The performance gap between LUCID and Standard Attention increases from +19.8% (32K finetuning) to +47.3% (64K finetuning).
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1: Gradient Preservation in LUCID
  • proof