Table of Contents
Fetching ...

Round and Round We Go! What makes Rotary Positional Encodings useful?

Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, Petar Veličković

TL;DR

The study challenges the conventional view that RoPE’s utility stems from decay of attention with distance, showing instead that different RoPE frequencies serve distinct roles: high frequencies enable robust positional heads, while low frequencies form semantic channels that are not robust over very long contexts. The authors provide theoretical constructions and empirical evidence from Gemma 7B (and corroborating work with Llama3.1 8B) to demonstrate these mechanisms, and they introduce p-RoPE, a practical modification that removes a fraction of the lowest frequencies to improve long-context performance and perplexity. They further show that NoPE cannot replicate the positional-head constructions RoPE enables, and they offer insights into how base wavelength adjustments and frequency filtering influence long-context generalization. Overall, the work advances a mechanistic, frequency-aware understanding of RoPE that can guide future scaling and long-context modeling of LLMs.

Abstract

Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most popular types of encoding used today in LLMs are Rotary Positional Encodings (RoPE), that rotate the queries and keys based on their relative distance. A common belief is that RoPE is useful because it helps to decay token dependency as relative distance increases. In this work, we argue that this is unlikely to be the core reason. We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level. We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies. We also find that, in general, Gemma greatly prefers to use the lowest frequencies of RoPE, which we suspect are used to carry semantic information. We mathematically prove interesting behaviours of RoPE and conduct experiments to verify our findings, proposing a modification of RoPE that fixes some highlighted issues and improves performance. We believe that this work represents an interesting step in better understanding PEs in LLMs, which we believe holds crucial value for scaling LLMs to large sizes and context lengths.

Round and Round We Go! What makes Rotary Positional Encodings useful?

TL;DR

The study challenges the conventional view that RoPE’s utility stems from decay of attention with distance, showing instead that different RoPE frequencies serve distinct roles: high frequencies enable robust positional heads, while low frequencies form semantic channels that are not robust over very long contexts. The authors provide theoretical constructions and empirical evidence from Gemma 7B (and corroborating work with Llama3.1 8B) to demonstrate these mechanisms, and they introduce p-RoPE, a practical modification that removes a fraction of the lowest frequencies to improve long-context performance and perplexity. They further show that NoPE cannot replicate the positional-head constructions RoPE enables, and they offer insights into how base wavelength adjustments and frequency filtering influence long-context generalization. Overall, the work advances a mechanistic, frequency-aware understanding of RoPE that can guide future scaling and long-context modeling of LLMs.

Abstract

Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most popular types of encoding used today in LLMs are Rotary Positional Encodings (RoPE), that rotate the queries and keys based on their relative distance. A common belief is that RoPE is useful because it helps to decay token dependency as relative distance increases. In this work, we argue that this is unlikely to be the core reason. We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level. We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies. We also find that, in general, Gemma greatly prefers to use the lowest frequencies of RoPE, which we suspect are used to carry semantic information. We mathematically prove interesting behaviours of RoPE and conduct experiments to verify our findings, proposing a modification of RoPE that fixes some highlighted issues and improves performance. We believe that this work represents an interesting step in better understanding PEs in LLMs, which we believe holds crucial value for scaling LLMs to large sizes and context lengths.
Paper Structure (31 sections, 8 theorems, 16 equations, 19 figures, 4 tables)

This paper contains 31 sections, 8 theorems, 16 equations, 19 figures, 4 tables.

Key Result

Proposition 3.1

Given any query $\mathbf{q}$ and any relative distance $r \in \mathbb{Z}$, we can find a key $\mathbf{k}$ such that the softmax value is largest at distance $r$ with RoPE.

Figures (19)

  • Figure 1: Depiction of our construction which allows Transformers to obtain positional attention heads using RoPE -- zooming in on a single RoPE frequency for clarity. On the left we depict key and query vectors for each position $i$, where keys are all identical, and queries are just a rotated version of the key, in a way that matches one of RoPE's highest frequencies. The center depicts how keys get rotated by RoPE, making the key at $i-1$ perfectly align with the query. Due to the high frequency of the rotation, all other keys will lead to a smaller attention weight. On the right we show the resulting attention weights, resulting in this case in an off-diagonal positional attention. See Section \ref{['sec:high-frequencies']} for more details.
  • Figure 2: RoPE applied to either (a) constant 'all-ones' queries and keys or (b) queries and keys with entries sampled IID from a Gaussian. The decay of the activations is present when the queries and keys are constant all-ones vectors, but not when they are Gaussian random vectors.
  • Figure 3: $2$-norm plotted over $2$-dimensional chunks of queries (a) and keys (b) for each layer in Gemma 7B, corresponding to different RoPE frequencies. A mean is taken over $10$ different Shakespeare quotes and the $16$ attention heads at each layer.
  • Figure 4: $2$-norm plotted over $2$-dimensional chunks of queries (a) and keys (b) for each attention head of the first layer in Gemma 7B, corresponding to different RoPE frequencies. A mean is taken over $10$ different Shakespeare quotes. We explain in Section \ref{['sec:high-frequencies']} the high frequency behaviour in Head 5 and Head 8.
  • Figure 5: Examples of purely positional heads occurring in Gemma 7B, showcasing a diagonal head at the last layer (a) and a previous-token head at the first layer (b).
  • ...and 14 more figures

Theorems & Definitions (15)

  • Proposition 3.1: RoPE can be maximal at arbitrary distance
  • Proposition 3.2: Gaussian queries and keys do not decay.
  • Definition 5.1
  • Proposition 5.2
  • Theorem 5.3
  • Theorem 6.1
  • Lemma A.1
  • proof
  • proof
  • proof
  • ...and 5 more