ReAttention: Training-Free Infinite Context with Finite Attention Scope
Xiaoran Liu, Ruixiao Li, Qipeng Guo, Zhigeng Liu, Yuerong Song, Kai Lv, Hang Yan, Linlin Li, Qun Liu, Xipeng Qiu
TL;DR
ReAttention tackles the bottleneck of infinite context in Transformer-based LLMs by introducing a training-free cache-selection stage that precedes standard self-attention. By performing position-agnostic top-k selection on a KV cache and then applying position-aware self-attention to the concatenated, reduced cache, it achieves infinite context with a fixed attention window and preserves compatibility with existing accelerators via a Triton-accelerated kernel. Empirical evaluation across multiple models and benchmarks shows ReAttention matches or surpasses full attention on long-context tasks, significantly outperforms StreamingLLM, and demonstrates practical extrapolation up to $1\mathrm{M}$ tokens (and beyond in some cases) without retraining. The work also presents a detailed efficiency analysis and hyper-parameter study, highlighting strong memory savings and speed-ups, making long-context LLM deployment more feasible in real-world settings.
Abstract
The long-context capability of the Large Language Models (LLM) has made significant breakthroughs, but the maximum supported context length in length extrapolation remains a critical bottleneck limiting their practical applications. The constraint of context length in LLMs arises from the self-attention mechanism, which cannot effectively and efficiently capture the semantic relationships within infinitely long contexts via the limited pre-trained positional information and attention scope. In this work, we propose ReAttention, a training-free approach enabling LLM based on the self-attention mechanism to support an infinite context with a finite attention scope under sufficient memory resources. ReAttention performs the position-agnostic top-$k$ attention before the ordinary position-aware self-attention, freeing LLMs from the length extrapolation issue. We validate the performance of ReAttention on the LongBench, L-Eval, and InfiniteBench and demonstrate that it is on par with traditional methods. Furthermore, we also apply ReAttention on mainstream LLMs, including LLaMA3.1-8B and Mistral-v0.3-7B, enabling them to support context lengths of at least 1M and even expanding the context length of LLaMA3.2-3B-chat by 128$\times$ to 4M without any further training in Needle-In-A-Haystack tests. We also improve the efficiency of ReAttention with Triton and achieve an efficient extrapolation without additional overhead. The code is available at https://github.com/OpenMOSS/ReAttention.
