Table of Contents
Fetching ...

CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs

Haoran Li, Sucheng Ren, Alan Yuille, Feng Wang

TL;DR

CoPE addresses the core challenge of long-context generalization in RoPE-based LLMs by identifying low-frequency RoPE components as the common source of both OOD extrapolation and semantic decay. It introduces a soft clipping mechanism that gradually attenuates these low frequencies, preventing spectral leakage and preserving long-range semantic signals. The method is a plug-in replacement for RoPE and works in tandem with existing long-context training recipes like ABF and YaRN. Empirical results on HELMET up to 256k context show consistent gains across real-world tasks and benchmarks, establishing CoPE as a scalable, practical enhancement for length generalization in LLMs.

Abstract

Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out-of-distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the attention scores computed with RoPE should always prioritize semantically similar tokens. In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely CoPE: soft clipping lowfrequency components of RoPE. CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping. Extensive experiments demonstrate that simply applying our soft clipping strategy to RoPE yields significant performance gains that scale up to 256k context length, validating our theoretical analysis and establishing CoPE as a new state-of-the-art for length generalization. Our code, data, and models are available at https://github.com/hrlics/CoPE.

CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs

TL;DR

CoPE addresses the core challenge of long-context generalization in RoPE-based LLMs by identifying low-frequency RoPE components as the common source of both OOD extrapolation and semantic decay. It introduces a soft clipping mechanism that gradually attenuates these low frequencies, preventing spectral leakage and preserving long-range semantic signals. The method is a plug-in replacement for RoPE and works in tandem with existing long-context training recipes like ABF and YaRN. Empirical results on HELMET up to 256k context show consistent gains across real-world tasks and benchmarks, establishing CoPE as a scalable, practical enhancement for length generalization in LLMs.

Abstract

Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out-of-distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the attention scores computed with RoPE should always prioritize semantically similar tokens. In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely CoPE: soft clipping lowfrequency components of RoPE. CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping. Extensive experiments demonstrate that simply applying our soft clipping strategy to RoPE yields significant performance gains that scale up to 256k context length, validating our theoretical analysis and establishing CoPE as a new state-of-the-art for length generalization. Our code, data, and models are available at https://github.com/hrlics/CoPE.
Paper Structure (24 sections, 2 theorems, 18 equations, 4 figures, 6 tables)

This paper contains 24 sections, 2 theorems, 18 equations, 4 figures, 6 tables.

Key Result

Theorem 3.2

Assume query $\mathbf{q} \in \mathbb{R}^d$ and key $\mathbf{k} \in \mathbb{R}^d$ have distance $\Delta t$ and i.i.d. components with standard deviation $\sigma$. Let $\mathbf{k}^{\prime}=\mathbf{q}+\mathbf{\epsilon}$ denote a similar key to the query, where $\mathbf{\epsilon}$ is a zero-mean perturb where $\sum_{i=0}^{d/2-1} \cos(\Delta t \theta_i)$ decays as $\Delta t$ increases.

Figures (4)

  • Figure 1: Performance comparison between CoPE and RoPE. With a simple soft clipping strategy, CoPE effectively improves RoPE's performance both within the training range and during extrapolation. The training context length here is $64$k.
  • Figure 2: (a) Visualization of RoPE frequencies. Low-frequency components in higher dimensions possess longer periods. The region shaded in red marks where the period exceeds the pre-training context window, leading to OOD extrapolation. (b) Spectral comparison. Unlike RoPE which keeps unstable low frequencies (blue), or Hard Clipping which causes an abrupt cut-off and spectral leakage, CoPE implements a soft decay strategy starting from the clipping onset, simultaneously eliminating OOD outliers and refines semantic signals.
  • Figure 3: Long-term decay of semantic attention. As relative distance increases, the model’s ability to prefer semantically similar tokens over random ones diminishes. Applying soft clipping to the low-frequency components (CoPE) effectively alleviates this decay, preserving semantic information over long contexts.
  • Figure 4: Ringing artifacts caused by hard clipping. Directly applying a hard clipping to the low-frequency components introduces an abrupt spectral cutoff, which causes spectral leakage and manifests as long-range oscillatory ringing in the attention signal (Gibbs phenomenon).

Theorems & Definitions (5)

  • Definition 3.1: Critical Dimensions in Extrapolation
  • Theorem 3.2: Long-term Decay of Semantic Attention
  • Theorem 4.1: Spectral Leakage from Hard Clipping
  • proof
  • proof