DoPE: Denoising Rotary Position Embedding

Jing Xiong; Liyang Fan; Hui Shen; Zunhai Su; Min Yang; Lingpeng Kong; Ngai Wong

DoPE: Denoising Rotary Position Embedding

Jing Xiong, Liyang Fan, Hui Shen, Zunhai Su, Min Yang, Lingpeng Kong, Ngai Wong

TL;DR

DoPE identifies a fundamental flaw in Rotary Position Embedding: low-frequency RoPE bands can produce coherent spikes that cause attention sinks and poor length extrapolation. By treating RoPE as a noisy feature map and applying truncated matrix entropy, DoPE locates low-entropy (near-rank-one) bands and selectively denoises them at the head level, either by removing them (DoPE-by-parts) or replacing them with stochastic or masked components (DoPE-by-all, DoPE-by-Gaussian). The approach is training-free and relies on a parameter-free Gaussian reparameterization to stabilize extrapolation, with empirical gains on long contexts up to 64K tokens across needle-in-a-haystack and many-shot ICL tasks. The work provides both theoretical insight into attention sinks and practical, simple denoising strategies that restore balanced attention and improve long-context performance in RoPE-based transformers.

Abstract

Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length extrapolation. We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional Encoding (DoPE), a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map. Leveraging the noise characteristics of the feature map, we further reparameterize it with a parameter-free Gaussian distribution to achieve robust extrapolation. Our method theoretically reveals the underlying cause of the attention sink phenomenon and its connection to truncated matrix entropy. Experiments on needle-in-a-haystack and many-shot in-context learning tasks demonstrate that DoPE significantly improves retrieval accuracy and reasoning stability across extended contexts (up to 64K tokens). The results show that the denoising strategy for positional embeddings effectively mitigates attention sinks and restores balanced attention patterns, providing a simple yet powerful solution for improving length generalization. Our project page is Project: https://The-physical-picture-of-LLMs.github.io

DoPE: Denoising Rotary Position Embedding

TL;DR

Abstract

DoPE: Denoising Rotary Position Embedding

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (4)