Table of Contents
Fetching ...

Frayed RoPE and Long Inputs: A Geometric Perspective

Davis Wertheimer, Aozhong Zhang, Derrick Liu, Penghang Yin, Naigang Wang

Abstract

Rotary Positional Embedding (RoPE) is a widely adopted technique for encoding position in language models, which, while effective, causes performance breakdown when input length exceeds training length. Prior analyses assert (rightly) that long inputs cause channels to rotate ``out of distribution,'' but it is not clear how extra rotation relates to or causes pathological behavior. Through empirical and theoretical analysis we advance a unified geometric understanding of attention behavior with RoPE. We find that attention induces tight clustering of separated key and query latent point clouds, allowing for creation of sink tokens: placeholders that allow attention heads to avoid token mixing when not required. RoPE applied to longer inputs damages this key/query cluster separation, producing pathological behavior by inhibiting sink token functionality. From this geometric perspective, we propose RoPE-ID (In Distribution), a straightforward modification that allows attention layers to generalize to longer inputs out of the box: apply RoPE with high frequency to a subset of channels. We demonstrate the effectiveness of RoPE-ID for extended inputs using 1B and 3B parameter Transformers on the LongBench and RULER information retrieval benchmarks.

Frayed RoPE and Long Inputs: A Geometric Perspective

Abstract

Rotary Positional Embedding (RoPE) is a widely adopted technique for encoding position in language models, which, while effective, causes performance breakdown when input length exceeds training length. Prior analyses assert (rightly) that long inputs cause channels to rotate ``out of distribution,'' but it is not clear how extra rotation relates to or causes pathological behavior. Through empirical and theoretical analysis we advance a unified geometric understanding of attention behavior with RoPE. We find that attention induces tight clustering of separated key and query latent point clouds, allowing for creation of sink tokens: placeholders that allow attention heads to avoid token mixing when not required. RoPE applied to longer inputs damages this key/query cluster separation, producing pathological behavior by inhibiting sink token functionality. From this geometric perspective, we propose RoPE-ID (In Distribution), a straightforward modification that allows attention layers to generalize to longer inputs out of the box: apply RoPE with high frequency to a subset of channels. We demonstrate the effectiveness of RoPE-ID for extended inputs using 1B and 3B parameter Transformers on the LongBench and RULER information retrieval benchmarks.
Paper Structure (25 sections, 6 theorems, 20 equations, 42 figures, 8 tables)

This paper contains 25 sections, 6 theorems, 20 equations, 42 figures, 8 tables.

Key Result

Lemma 1

Assume the key/query matrix $\boldsymbol{X} = \boldsymbol{u}\boldsymbol{v}^\top$, where $\boldsymbol{u}\in\mathbb{R}^n$, $\boldsymbol{v}\in\mathbb{R}^d$ with $\|\boldsymbol{v}\|_2 = 1$. Then as the inference sequence length $n\to \infty$, under mild conditions on $\boldsymbol{u}$, applying RoPE to $

Figures (42)

  • Figure 1: A 2D diagram of our observed latent geometry. Left: Keys/queries cluster tightly into opposing point clouds with negative dot products. The sink token has low norm and thus the greatest dot product. Matched key/query pair $q,k$ align orthogonally, letting their product approach and exceed the sink's. Right: RoPE on long inputs makes keys/queries disperse and overlap, causing spurious alignment. Sink token stops functioning.
  • Figure 2: Effect of RoPE across context length on pairwise angular distances within heads for Llama3-8B, Gemma-7B and OLMo-7B.
  • Figure 3: 2D PCA projections of Llama3 representations under different context lengths and RoPE settings (3rd key head of layer 21 and its queries). RoPE at long contexts destroys cluster separation, and increases stable rank (in parentheses) with sequence length, consistent with Theorem \ref{['thm:stable_rank']}.
  • Figure 4: Left: Histogram across layers and heads showing the percentage of variance (relative to origin) explained by the first principal component of latent key/query clusters. Middle: Ratio of first singular value (FSV) after and before RoPE. Blue lines plot individual key/query heads, red plots the average trend. RoPE shrinks the FSV, but accelerates beyond the training length, producing cluster dispersal. Right: Mean stable rank of key clusters by input length, showing monotonic increase with RoPE, indicating cluster dispersal. Error bars denote standard deviation.
  • Figure 5: Left: Key $\ell_2$ norm as a function of position in sequence. Sink token is consistently small. Right: Keys have low dot product against subsequent queries in expectation, except for the first and most recent tokens. Scores are normalized by the highest value, in this case always the sink.
  • ...and 37 more figures

Theorems & Definitions (10)

  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Lemma 1: formal
  • Remark 1
  • proof : Proof of Lemma \ref{['lem:spectral']}
  • Lemma 2
  • proof : Proof of Lemma \ref{['lem:fro']}
  • Theorem 1: formal
  • proof : Proof of Theorem \ref{['thm:stable_rank']}