Table of Contents
Fetching ...

Base of RoPE Bounds Context Length

Xin Men, Mingyu Xu, Bingning Wang, Qingyu Zhang, Hongyu Lin, Xianpei Han, Weipeng Chen

TL;DR

This work challenges the prevailing OOD-based explanations for long-context extrapolation in RoPE by establishing an absolute lower bound on the RoPE base required to support a given context length. It introduces a long-term decay property via B_{m,θ}, showing that the ability to attend to similar tokens diminishes with relative distance, which in turn bounds the effective context length. Through experiments in fine-tuning and pre-training across multiple LLMs, the authors demonstrate that the RoPE base indeed constrains practical context length and that superficially long contexts from small bases fail to retrieve long-range information. The findings offer a principled framework for understanding and designing RoPE-based long-context strategies and highlight the limitations of relying solely on perplexity as a proxy for long-context capability.

Abstract

Position embedding is a core component of current Large Language Models (LLMs). Rotary position embedding (RoPE), a technique that encodes the position information with a rotation matrix, has been the de facto choice for position embedding in many LLMs, such as the Llama series. RoPE has been further utilized to extend long context capability, which is roughly based on adjusting the \textit{base} parameter of RoPE to mitigate out-of-distribution (OOD) problems in position embedding. However, in this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory. We revisit the role of RoPE in LLMs and propose a novel property of long-term decay, we derive that the \textit{base of RoPE bounds context length}: there is an absolute lower bound for the base value to obtain certain context length capability. Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.

Base of RoPE Bounds Context Length

TL;DR

This work challenges the prevailing OOD-based explanations for long-context extrapolation in RoPE by establishing an absolute lower bound on the RoPE base required to support a given context length. It introduces a long-term decay property via B_{m,θ}, showing that the ability to attend to similar tokens diminishes with relative distance, which in turn bounds the effective context length. Through experiments in fine-tuning and pre-training across multiple LLMs, the authors demonstrate that the RoPE base indeed constrains practical context length and that superficially long contexts from small bases fail to retrieve long-range information. The findings offer a principled framework for understanding and designing RoPE-based long-context strategies and highlight the limitations of relying solely on perplexity as a proxy for long-context capability.

Abstract

Position embedding is a core component of current Large Language Models (LLMs). Rotary position embedding (RoPE), a technique that encodes the position information with a rotation matrix, has been the de facto choice for position embedding in many LLMs, such as the Llama series. RoPE has been further utilized to extend long context capability, which is roughly based on adjusting the \textit{base} parameter of RoPE to mitigate out-of-distribution (OOD) problems in position embedding. However, in this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory. We revisit the role of RoPE in LLMs and propose a novel property of long-term decay, we derive that the \textit{base of RoPE bounds context length}: there is an absolute lower bound for the base value to obtain certain context length capability. Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.
Paper Structure (26 sections, 1 theorem, 13 equations, 14 figures, 5 tables)

This paper contains 26 sections, 1 theorem, 13 equations, 14 figures, 5 tables.

Key Result

Theorem 1

Assuming that the components of query $q\in R^d$ and key $k\in R^d$ are independent and identically distributed, their standard deviations are denoted as $\sigma \in R$. The key $k^* = q+\epsilon$ is a token similar to the query, where $\epsilon$ is a random variable with a mean of 0. Then we have:

Figures (14)

  • Figure 1: Context length and its corresponding lower bound of RoPE's base value.
  • Figure 2: An illustration of OOD in RoPE when we extend context length from 4k to 32k, and two solutions to avoid the OOD. We show the last dimension as it is the lowest frequency part of RoPE, which suffers OOD mostly in extrapolation. (a) For a 4k context-length model with base value as 1e4, when we extend the context length to 32k without changing the base value, the context length from 4k to 32k is OOD for RoPE (red area in the figure). (b) OOD can be avoided with a small base value like 500 liu2024scaling, since the full period has been fitted during fine-tuning stage. (c) We set base as $b \cdot s^{\frac{d}{d-2}}$ from NTK peng2023yarn.The blue line denotes the pre-training stage (base=1e4) and the red dashed line denotes the fine-tuning stage (base=$b \cdot s^{\frac{d}{d-2}}$), we can observe that the RoPE's rotation angle of extended positions is in-distribution.
  • Figure 3: The superficial long context capability of avoiding OOD by the smaller base. Following the recent work liu2024scaling, we fine-tune Llama2-7B with a small base (500) to a context length of 32k.
  • Figure 4: The upper bound of attention score with respect to the relative distance.
  • Figure 5: The ability to attend more to similar tokens than random tokens.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Theorem 1