Table of Contents
Fetching ...

The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval

Ting-Rui Chiang, Dani Yogatama

TL;DR

The paper investigates whether Rotary Position Embedding (RoPE) causes dimension inefficiency in attention heads for long-distance retrieval. Through a controlled toy experiment and analyses of three real LLMs, it shows RoPE tends to reduce the utility of the initial dimensions in attention heads, while later dimensions remain more informative, particularly for long-context retrieval. It also links retrieval heads to dimension utilization, demonstrating that masking early dimensions often has little impact, whereas removing later dimensions harms performance, especially for distant documents. The findings suggest RoPE may waste computation on unused dimensions and motivate exploring RoPE alternatives or selective application to improve long-context efficiency and performance.

Abstract

The Rotary Position Embedding (RoPE) is widely used in the attention heads of many large language models (LLM). It rotates dimensions in the query and the key vectors by different angles according to their positions in the input sequence. For long context modeling, the range of positions may vary a lot, and thus RoPE rotates some dimensions by a great range of angles. We hypothesize that the wide range of rotation angles may prevent LLMs from utilizing those dimensions. To validate this hypothesis, we present a controlled experiment showing that applying RoPE causes low utility of certain dimensions. Our analyses on three LLMs also indicate that these dimensions do not help LLMs do long-context question answering.

The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval

TL;DR

The paper investigates whether Rotary Position Embedding (RoPE) causes dimension inefficiency in attention heads for long-distance retrieval. Through a controlled toy experiment and analyses of three real LLMs, it shows RoPE tends to reduce the utility of the initial dimensions in attention heads, while later dimensions remain more informative, particularly for long-context retrieval. It also links retrieval heads to dimension utilization, demonstrating that masking early dimensions often has little impact, whereas removing later dimensions harms performance, especially for distant documents. The findings suggest RoPE may waste computation on unused dimensions and motivate exploring RoPE alternatives or selective application to improve long-context efficiency and performance.

Abstract

The Rotary Position Embedding (RoPE) is widely used in the attention heads of many large language models (LLM). It rotates dimensions in the query and the key vectors by different angles according to their positions in the input sequence. For long context modeling, the range of positions may vary a lot, and thus RoPE rotates some dimensions by a great range of angles. We hypothesize that the wide range of rotation angles may prevent LLMs from utilizing those dimensions. To validate this hypothesis, we present a controlled experiment showing that applying RoPE causes low utility of certain dimensions. Our analyses on three LLMs also indicate that these dimensions do not help LLMs do long-context question answering.

Paper Structure

This paper contains 25 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Analysis of the dimensions in the attention head of the models (w/ and w/o applying RoPE) in §\ref{['sec:toy']}.
  • Figure 2: The average importance of each dimensions in the query vectors of the attention heads, measured by the L1 norm of rows in the query weight matrices (left) and by utility score in §\ref{['sec:utilization']} (right). We visualize all heads in Figure \ref{['fig:all-weight']} and Figure \ref{['fig:all-mask']}.
  • Figure 3: The relationship between the retrieval-head indicator score (x-axis) and utility score of the first 16 or last dimensions (y-axis). Each dot represents an attention head. The lighter dot color represents the deeper layers. The red line represents the linear regressor.
  • Figure 4: The relationship between the L1 norm of rows in query projection matrices (x-axis) and the utility scores of the first or last dimensions (y-axis). Each dot represents an attention head. The lighter dot color represents the deeper layers. The red line represents the linear regressor.
  • Figure 5: Visualizing the L1 norm of the rows in the query projection matrices.
  • ...and 1 more figures