Table of Contents
Fetching ...

ReAttention: Training-Free Infinite Context with Finite Attention Scope

Xiaoran Liu, Ruixiao Li, Qipeng Guo, Zhigeng Liu, Yuerong Song, Kai Lv, Hang Yan, Linlin Li, Qun Liu, Xipeng Qiu

TL;DR

ReAttention tackles the bottleneck of infinite context in Transformer-based LLMs by introducing a training-free cache-selection stage that precedes standard self-attention. By performing position-agnostic top-k selection on a KV cache and then applying position-aware self-attention to the concatenated, reduced cache, it achieves infinite context with a fixed attention window and preserves compatibility with existing accelerators via a Triton-accelerated kernel. Empirical evaluation across multiple models and benchmarks shows ReAttention matches or surpasses full attention on long-context tasks, significantly outperforms StreamingLLM, and demonstrates practical extrapolation up to $1\mathrm{M}$ tokens (and beyond in some cases) without retraining. The work also presents a detailed efficiency analysis and hyper-parameter study, highlighting strong memory savings and speed-ups, making long-context LLM deployment more feasible in real-world settings.

Abstract

The long-context capability of the Large Language Models (LLM) has made significant breakthroughs, but the maximum supported context length in length extrapolation remains a critical bottleneck limiting their practical applications. The constraint of context length in LLMs arises from the self-attention mechanism, which cannot effectively and efficiently capture the semantic relationships within infinitely long contexts via the limited pre-trained positional information and attention scope. In this work, we propose ReAttention, a training-free approach enabling LLM based on the self-attention mechanism to support an infinite context with a finite attention scope under sufficient memory resources. ReAttention performs the position-agnostic top-$k$ attention before the ordinary position-aware self-attention, freeing LLMs from the length extrapolation issue. We validate the performance of ReAttention on the LongBench, L-Eval, and InfiniteBench and demonstrate that it is on par with traditional methods. Furthermore, we also apply ReAttention on mainstream LLMs, including LLaMA3.1-8B and Mistral-v0.3-7B, enabling them to support context lengths of at least 1M and even expanding the context length of LLaMA3.2-3B-chat by 128$\times$ to 4M without any further training in Needle-In-A-Haystack tests. We also improve the efficiency of ReAttention with Triton and achieve an efficient extrapolation without additional overhead. The code is available at https://github.com/OpenMOSS/ReAttention.

ReAttention: Training-Free Infinite Context with Finite Attention Scope

TL;DR

ReAttention tackles the bottleneck of infinite context in Transformer-based LLMs by introducing a training-free cache-selection stage that precedes standard self-attention. By performing position-agnostic top-k selection on a KV cache and then applying position-aware self-attention to the concatenated, reduced cache, it achieves infinite context with a fixed attention window and preserves compatibility with existing accelerators via a Triton-accelerated kernel. Empirical evaluation across multiple models and benchmarks shows ReAttention matches or surpasses full attention on long-context tasks, significantly outperforms StreamingLLM, and demonstrates practical extrapolation up to tokens (and beyond in some cases) without retraining. The work also presents a detailed efficiency analysis and hyper-parameter study, highlighting strong memory savings and speed-ups, making long-context LLM deployment more feasible in real-world settings.

Abstract

The long-context capability of the Large Language Models (LLM) has made significant breakthroughs, but the maximum supported context length in length extrapolation remains a critical bottleneck limiting their practical applications. The constraint of context length in LLMs arises from the self-attention mechanism, which cannot effectively and efficiently capture the semantic relationships within infinitely long contexts via the limited pre-trained positional information and attention scope. In this work, we propose ReAttention, a training-free approach enabling LLM based on the self-attention mechanism to support an infinite context with a finite attention scope under sufficient memory resources. ReAttention performs the position-agnostic top- attention before the ordinary position-aware self-attention, freeing LLMs from the length extrapolation issue. We validate the performance of ReAttention on the LongBench, L-Eval, and InfiniteBench and demonstrate that it is on par with traditional methods. Furthermore, we also apply ReAttention on mainstream LLMs, including LLaMA3.1-8B and Mistral-v0.3-7B, enabling them to support context lengths of at least 1M and even expanding the context length of LLaMA3.2-3B-chat by 128 to 4M without any further training in Needle-In-A-Haystack tests. We also improve the efficiency of ReAttention with Triton and achieve an efficient extrapolation without additional overhead. The code is available at https://github.com/OpenMOSS/ReAttention.
Paper Structure (25 sections, 3 equations, 12 figures, 7 tables)

This paper contains 25 sections, 3 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Overview of ReAttention.
  • Figure 2: Results of ReAttention-enhanced existing mainstream LLMs, including LLaMA3-8B-8K and Mistral-v0.3-7B-32K, on Needle-In-A-Haystack needle_in_a_haystack implemented in OpenCompass 2023opencompass.
  • Figure 3: Results of Multi-Needle-In-A-Haystack reid2024gemini and Single NIAH in a longer context length implemented in OpenCompass 2023opencompass.
  • Figure 4: Overview of the kernel fusion in our customized top-$k$ attention kernel. The performance measurements reflect the execution time of the corresponding kernel functions, with the input length 8K for Llama3.1-8B inference tasks.
  • Figure 4: The performance of Dynamic NTK, InfLLM, and ReAttention on RULER benchmark in 8K and 16K context length. S3 and MK3 are the short forms of NIAH-Single3 and NIAH-Multikey3 respectively.
  • ...and 7 more figures