Base of RoPE Bounds Context Length
Xin Men, Mingyu Xu, Bingning Wang, Qingyu Zhang, Hongyu Lin, Xianpei Han, Weipeng Chen
TL;DR
This work challenges the prevailing OOD-based explanations for long-context extrapolation in RoPE by establishing an absolute lower bound on the RoPE base required to support a given context length. It introduces a long-term decay property via B_{m,θ}, showing that the ability to attend to similar tokens diminishes with relative distance, which in turn bounds the effective context length. Through experiments in fine-tuning and pre-training across multiple LLMs, the authors demonstrate that the RoPE base indeed constrains practical context length and that superficially long contexts from small bases fail to retrieve long-range information. The findings offer a principled framework for understanding and designing RoPE-based long-context strategies and highlight the limitations of relying solely on perplexity as a proxy for long-context capability.
Abstract
Position embedding is a core component of current Large Language Models (LLMs). Rotary position embedding (RoPE), a technique that encodes the position information with a rotation matrix, has been the de facto choice for position embedding in many LLMs, such as the Llama series. RoPE has been further utilized to extend long context capability, which is roughly based on adjusting the \textit{base} parameter of RoPE to mitigate out-of-distribution (OOD) problems in position embedding. However, in this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory. We revisit the role of RoPE in LLMs and propose a novel property of long-term decay, we derive that the \textit{base of RoPE bounds context length}: there is an absolute lower bound for the base value to obtain certain context length capability. Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.
