Table of Contents
Fetching ...

Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective

Meizhi Zhong, Chen Zhang, Yikun Lei, Xikai Liu, Yan Gao, Yao Hu, Kehai Chen, Min Zhang

TL;DR

This paper addresses how RoPE extensions enable long-context processing in decoder-only LLMs and aims to reveal their inner workings from an attention perspective. It evaluates three extensions—Position Interpolation (PI), YaRN, and NTK-Aware interpolation—on perplexity and Needle-in-a-Haystack tasks across LLaMa variants. Key findings include that these extensions preserve attention patterns at pretrained lengths and that large attention uncertainty correlates with retrieval errors, while longer continual pretraining reduces uncertainty and improves extrapolation (NTK up to $128K$, PI/YaRN around $62K$). The work provides practical guidance for deploying RoPE extensions in long-context settings and motivates further validation across architectures using attention-centric diagnostics like entropy and distribution similarity.

Abstract

Enabling LLMs to handle lengthy context is currently a research hotspot. Most LLMs are built upon rotary position embedding (RoPE), a popular position encoding method. Therefore, a prominent path is to extrapolate the RoPE trained on comparably short texts to far longer texts. A heavy bunch of efforts have been dedicated to boosting the extrapolation via extending the formulations of the RoPE, however, few of them have attempted to showcase their inner workings comprehensively. In this paper, we are driven to offer a straightforward yet in-depth understanding of RoPE extensions from an attention perspective and on two benchmarking tasks. A broad array of experiments reveals several valuable findings: 1) Maintaining attention patterns to those at the pretrained length improves extrapolation; 2) Large attention uncertainty leads to retrieval errors; 3) Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.

Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective

TL;DR

This paper addresses how RoPE extensions enable long-context processing in decoder-only LLMs and aims to reveal their inner workings from an attention perspective. It evaluates three extensions—Position Interpolation (PI), YaRN, and NTK-Aware interpolation—on perplexity and Needle-in-a-Haystack tasks across LLaMa variants. Key findings include that these extensions preserve attention patterns at pretrained lengths and that large attention uncertainty correlates with retrieval errors, while longer continual pretraining reduces uncertainty and improves extrapolation (NTK up to , PI/YaRN around ). The work provides practical guidance for deploying RoPE extensions in long-context settings and motivates further validation across architectures using attention-centric diagnostics like entropy and distribution similarity.

Abstract

Enabling LLMs to handle lengthy context is currently a research hotspot. Most LLMs are built upon rotary position embedding (RoPE), a popular position encoding method. Therefore, a prominent path is to extrapolate the RoPE trained on comparably short texts to far longer texts. A heavy bunch of efforts have been dedicated to boosting the extrapolation via extending the formulations of the RoPE, however, few of them have attempted to showcase their inner workings comprehensively. In this paper, we are driven to offer a straightforward yet in-depth understanding of RoPE extensions from an attention perspective and on two benchmarking tasks. A broad array of experiments reveals several valuable findings: 1) Maintaining attention patterns to those at the pretrained length improves extrapolation; 2) Large attention uncertainty leads to retrieval errors; 3) Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
Paper Structure (11 sections, 1 equation, 7 figures, 1 table)

This paper contains 11 sections, 1 equation, 7 figures, 1 table.

Figures (7)

  • Figure 1: Perplexity on Proof-pile(Lower is better).
  • Figure 2: Attention distributions of RoPE, PI, YaRN, and NTK methods on 2K and 8K sequences. The red line represents the mean attention scores across all heads, layers, and examples. The other lines indicate the attention scores for each head in each layer.
  • Figure 3: Attention distributions of RoPE, PI, YaRN, and NTK methods on 2K and 8K sequences on MiniMA-2-3B.
  • Figure 4: Performance comparison for the Needle-in-a-Haystack Test. The x-axis represents the length of the document, while y-axis indicates the depth percentage, showing the needle's position within the document. For instance, a position of 50% signifies that the needle is placed in the middle of the document. A red cell indicates that the model fails to recall the information in the needle, whereas a green cell indicates success. A white dashed line denotes the model's continual pretrain length. Each value in the cells signifies the mean attention entropy, with higher values reflecting more dispersed attention.
  • Figure 5: Performance comparison for the Needle-in-a-Haystack Test of MiniMa-2-3B.
  • ...and 2 more figures