Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective
Meizhi Zhong, Chen Zhang, Yikun Lei, Xikai Liu, Yan Gao, Yao Hu, Kehai Chen, Min Zhang
TL;DR
This paper addresses how RoPE extensions enable long-context processing in decoder-only LLMs and aims to reveal their inner workings from an attention perspective. It evaluates three extensions—Position Interpolation (PI), YaRN, and NTK-Aware interpolation—on perplexity and Needle-in-a-Haystack tasks across LLaMa variants. Key findings include that these extensions preserve attention patterns at pretrained lengths and that large attention uncertainty correlates with retrieval errors, while longer continual pretraining reduces uncertainty and improves extrapolation (NTK up to $128K$, PI/YaRN around $62K$). The work provides practical guidance for deploying RoPE extensions in long-context settings and motivates further validation across architectures using attention-centric diagnostics like entropy and distribution similarity.
Abstract
Enabling LLMs to handle lengthy context is currently a research hotspot. Most LLMs are built upon rotary position embedding (RoPE), a popular position encoding method. Therefore, a prominent path is to extrapolate the RoPE trained on comparably short texts to far longer texts. A heavy bunch of efforts have been dedicated to boosting the extrapolation via extending the formulations of the RoPE, however, few of them have attempted to showcase their inner workings comprehensively. In this paper, we are driven to offer a straightforward yet in-depth understanding of RoPE extensions from an attention perspective and on two benchmarking tasks. A broad array of experiments reveals several valuable findings: 1) Maintaining attention patterns to those at the pretrained length improves extrapolation; 2) Large attention uncertainty leads to retrieval errors; 3) Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
