LUCID: Attention with Preconditioned Representations
Sai Surya Duvvuri, Nirmal Patel, Nilesh Gupta, Inderjit S. Dhillon
TL;DR
LUCID tackles the degradation of softmax attention in long contexts by introducing an RKHS-based preconditioner that decorrelates keys, enabling precise retrieval without sacrificing gradient flow. Derived from a quadratic retrieval objective, LUCID yields a triangular solve preconditioner that preserves $\mathcal{O}(N^2D)$ complexity while sharpening attention. Empirically, LUCID and its PaTH-enhanced variant achieve significant gains on long-context retrieval tasks (e.g., BABILong and RULER) with modest training overhead and near-identical inference latency. The approach demonstrates robust performance across needle-in-a-haystack benchmarks, long-document reasoning, and multi-document understanding, highlighting the practical impact for scalable long-context LLMs.
Abstract
Softmax-based dot-product attention is a cornerstone of Transformer architectures, enabling remarkable capabilities such as in-context learning. However, as context lengths increase, a fundamental limitation of the softmax function emerges: it tends to diffuse probability mass to irrelevant tokens degrading performance in long-sequence scenarios. Furthermore, attempts to sharpen focus by lowering softmax temperature hinder learnability due to vanishing gradients. We introduce LUCID Attention, an architectural modification that applies a preconditioner to the attention probabilities. This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space, thus allowing the query to focus on important keys among large number of keys accurately with same computational complexity as standard attention. Additionally, LUCID's preconditioning-based approach to retrieval bypasses the need for low temperature and the learnability problems associated with it. We validate our approach by training ~1 billion parameter language models evaluated on up to 128K tokens. Our results demonstrate significant gains on long-context retrieval tasks, specifically retrieval tasks from BABILong, RULER, SCROLLS and LongBench. For instance, LUCID achieves up to 18% improvement in BABILong and 14% improvement in RULER multi-needle performance compared to standard attention.
