The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval
Ting-Rui Chiang, Dani Yogatama
TL;DR
The paper investigates whether Rotary Position Embedding (RoPE) causes dimension inefficiency in attention heads for long-distance retrieval. Through a controlled toy experiment and analyses of three real LLMs, it shows RoPE tends to reduce the utility of the initial dimensions in attention heads, while later dimensions remain more informative, particularly for long-context retrieval. It also links retrieval heads to dimension utilization, demonstrating that masking early dimensions often has little impact, whereas removing later dimensions harms performance, especially for distant documents. The findings suggest RoPE may waste computation on unused dimensions and motivate exploring RoPE alternatives or selective application to improve long-context efficiency and performance.
Abstract
The Rotary Position Embedding (RoPE) is widely used in the attention heads of many large language models (LLM). It rotates dimensions in the query and the key vectors by different angles according to their positions in the input sequence. For long context modeling, the range of positions may vary a lot, and thus RoPE rotates some dimensions by a great range of angles. We hypothesize that the wide range of rotation angles may prevent LLMs from utilizing those dimensions. To validate this hypothesis, we present a controlled experiment showing that applying RoPE causes low utility of certain dimensions. Our analyses on three LLMs also indicate that these dimensions do not help LLMs do long-context question answering.
