Pay Attention to What You Need
Yifei Gao, Shaohong Chen, Lei Wang, Ruiting Dai, Ziyun Zhang, Kerui Ren, Jiaji Wu, Jun Cheng
TL;DR
This paper tackles the challenge of long-context understanding in large language models without resorting to costly retraining. It introduces Scaled ReAttention (SRA), an inference-time, plug-and-play technique that discards low-impact attention weights and reallocates their influence toward informative tokens (Hidden Gems), trading a small stability cost for notable gains in retrieval and long-text comprehension. Through comprehensive experiments on retrieval, summarization, and open benchmarks across multiple RoPE-based models, SRA demonstrates robust improvements with minimal or no fine-tuning, highlighting its practical value for industry deployment. The work also provides guidelines for configuring SRA and discusses limitations related to variability and inference speed, arguing that the gains in understanding and retrieval substantially outweigh these drawbacks in many applications.
Abstract
Although large language models (LLMs) have achieved significant success in natural language processing, they still struggle with long-context comprehension. Traditional approaches to mitigating this issue typically rely on fine-tuning or retraining, which is both resource-intensive and challenging to deploy in lightweight industrial settings. In this paper, we investigate the potential to accomplish this without any additional resources. Through an in-depth study of the attention mechanism in LLMs, we propose a method called Scaled ReAttention (SRA) to strengthen LLMs' ability to interpret and retrieve information by strategically manipulating their attention scores during inference. Through extensive experiments, we demonstrate that integrating SRA significantly boosts LLMs' performance on a variety of downstream tasks, highlighting its practical potential for enhancing language understanding without incurring the overhead of traditional training.
