Table of Contents
Fetching ...

Pay Attention to What You Need

Yifei Gao, Shaohong Chen, Lei Wang, Ruiting Dai, Ziyun Zhang, Kerui Ren, Jiaji Wu, Jun Cheng

TL;DR

This paper tackles the challenge of long-context understanding in large language models without resorting to costly retraining. It introduces Scaled ReAttention (SRA), an inference-time, plug-and-play technique that discards low-impact attention weights and reallocates their influence toward informative tokens (Hidden Gems), trading a small stability cost for notable gains in retrieval and long-text comprehension. Through comprehensive experiments on retrieval, summarization, and open benchmarks across multiple RoPE-based models, SRA demonstrates robust improvements with minimal or no fine-tuning, highlighting its practical value for industry deployment. The work also provides guidelines for configuring SRA and discusses limitations related to variability and inference speed, arguing that the gains in understanding and retrieval substantially outweigh these drawbacks in many applications.

Abstract

Although large language models (LLMs) have achieved significant success in natural language processing, they still struggle with long-context comprehension. Traditional approaches to mitigating this issue typically rely on fine-tuning or retraining, which is both resource-intensive and challenging to deploy in lightweight industrial settings. In this paper, we investigate the potential to accomplish this without any additional resources. Through an in-depth study of the attention mechanism in LLMs, we propose a method called Scaled ReAttention (SRA) to strengthen LLMs' ability to interpret and retrieve information by strategically manipulating their attention scores during inference. Through extensive experiments, we demonstrate that integrating SRA significantly boosts LLMs' performance on a variety of downstream tasks, highlighting its practical potential for enhancing language understanding without incurring the overhead of traditional training.

Pay Attention to What You Need

TL;DR

This paper tackles the challenge of long-context understanding in large language models without resorting to costly retraining. It introduces Scaled ReAttention (SRA), an inference-time, plug-and-play technique that discards low-impact attention weights and reallocates their influence toward informative tokens (Hidden Gems), trading a small stability cost for notable gains in retrieval and long-text comprehension. Through comprehensive experiments on retrieval, summarization, and open benchmarks across multiple RoPE-based models, SRA demonstrates robust improvements with minimal or no fine-tuning, highlighting its practical value for industry deployment. The work also provides guidelines for configuring SRA and discusses limitations related to variability and inference speed, arguing that the gains in understanding and retrieval substantially outweigh these drawbacks in many applications.

Abstract

Although large language models (LLMs) have achieved significant success in natural language processing, they still struggle with long-context comprehension. Traditional approaches to mitigating this issue typically rely on fine-tuning or retraining, which is both resource-intensive and challenging to deploy in lightweight industrial settings. In this paper, we investigate the potential to accomplish this without any additional resources. Through an in-depth study of the attention mechanism in LLMs, we propose a method called Scaled ReAttention (SRA) to strengthen LLMs' ability to interpret and retrieve information by strategically manipulating their attention scores during inference. Through extensive experiments, we demonstrate that integrating SRA significantly boosts LLMs' performance on a variety of downstream tasks, highlighting its practical potential for enhancing language understanding without incurring the overhead of traditional training.
Paper Structure (24 sections, 5 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 5 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Characteristics of attention in LLaMA-3-8B: (a) The sparsity level of attention in each layer, with the sparsity threshold set at 0.001 on text length 2048. (b) The perplexity of WikiText2 on text length 2048 after attention elimination. (c) Averaged performance on downstream tasks (ARC, PIQA, Hellaswag, Winogrande) after attention elimination. Even with 25$\%$ of attention weights eliminated (threshold 2e-2), the performance remains nearly unchanged. The black dashed line represents the original performance.
  • Figure 2: Performance degradation of LLaMA-2-7B-Chat after attention elimination on five LongBench tasks. Tokens with attention scores exceeding 0.05 are classified as Linchpins, those with scores in the 0.01–0.05 range (depending on their position) as Context Fillers or Hidden Gems, and those below 0.01 as Small Potatoes.
  • Figure 3: RoPE upper boundary alongside its averaged counterpart at intervals of 100-word index.
  • Figure 4: (a) Normalized distribution of $\mathbf{A}$. (b) Normalized distribution of $\mathbf{D} \mathbf{P} \mathbf{A}$. We aim to identify those Hidden Gems (blue boxes) with high similarity at distant positions.
  • Figure 5: Overall Pipeline. SRA first identifies heads where the inter/outer loop contains Hidden Gems and then extracts them for attention elimination. The eliminated attention scores will be amplified (Scaled) and redistributed to these Hidden Gems (ReAttention).
  • ...and 2 more figures