Table of Contents
Fetching ...

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs

Xiaoran Liu, Yuerong Song, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Zhaoxiang Liu, Shiguo Lian, Ziwei He, Xipeng Qiu

TL;DR

RoPE++ identifies a loss in the imaginary component of RoPE's complex attention and reintroduces it as a parallel imaginary attention head. By sharing QKV projections, RoPE++ preserves a unified absolute–relative position embedding and offers two configurations (RoPE++_EH and RoPE++_EC) that balance cache and throughput. Theoretical analysis and extensive experiments show improved long-context dependency modeling, with RoPE++ outperforming vanilla RoPE on long-context benchmarks and offering notable efficiency gains. The work provides open-source code and demonstrates that incorporating imaginary attention yields robust benefits for long-context LLMs while remaining compatible with existing long-context techniques.

Abstract

Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the real component of the complex-valued dot product for attention score calculation. This simplification discards the imaginary component, which contains valuable phase information, leading to a potential loss of relational details crucial for modeling long-context dependencies. In this paper, we propose an extension that re-incorporates this discarded imaginary component. Our method leverages the full complex-valued representation to create a dual-component attention score. We theoretically and empirically demonstrate that this approach enhances the modeling of long-context dependencies by preserving more positional information. Furthermore, evaluations on a suite of long-context language modeling benchmarks show that our method consistently improves performance over the standard RoPE, with the benefits becoming more significant as context length increases. The code is available at https://github.com/OpenMOSS/rope_pp.

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs

TL;DR

RoPE++ identifies a loss in the imaginary component of RoPE's complex attention and reintroduces it as a parallel imaginary attention head. By sharing QKV projections, RoPE++ preserves a unified absolute–relative position embedding and offers two configurations (RoPE++_EH and RoPE++_EC) that balance cache and throughput. Theoretical analysis and extensive experiments show improved long-context dependency modeling, with RoPE++ outperforming vanilla RoPE on long-context benchmarks and offering notable efficiency gains. The work provides open-source code and demonstrates that incorporating imaginary attention yields robust benefits for long-context LLMs while remaining compatible with existing long-context techniques.

Abstract

Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the real component of the complex-valued dot product for attention score calculation. This simplification discards the imaginary component, which contains valuable phase information, leading to a potential loss of relational details crucial for modeling long-context dependencies. In this paper, we propose an extension that re-incorporates this discarded imaginary component. Our method leverages the full complex-valued representation to create a dual-component attention score. We theoretically and empirically demonstrate that this approach enhances the modeling of long-context dependencies by preserving more positional information. Furthermore, evaluations on a suite of long-context language modeling benchmarks show that our method consistently improves performance over the standard RoPE, with the benefits becoming more significant as context length increases. The code is available at https://github.com/OpenMOSS/rope_pp.

Paper Structure

This paper contains 24 sections, 13 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Overview of RoPE++. RoPE retains only the real part of the complex-valued attention score, whereas RoPE++ exploits the full complex representation to produce both real and imaginary attention. The real attention exhibits stronger semantic locality, while the imaginary attention preferentially captures long-context dependencies. RoPE++ combines the two, yielding multiple advantages.
  • Figure 2: Visualization of GQA with different RoPE schema. RoPE++$_\text{EC}$ shares equal cache and twice the attention head with RoPE, while RoPE++$_\text{EH}$ has equal attention head and half the KV cache.
  • Figure 3: Comparison of trained position embedding interval between RoPE and RoPE++. The area within the dashed line represents trained relative position, and that beyond is in length extrapolation, with learned position embedding values colored in yellow and the opposite in gray.
  • Figure 4: Efficiency comparison between RoPE and RoPE++$_\text{EH}$ in 376M and 776M model. RoPE++$_\text{EH}$ lowers memory cost and accelerates decoding, and the margin widens as context grows.
  • Figure 5: Attention-score patterns and long-context performance in 376M and 776M RoPE++ models. Imaginary heads attend markedly to global information, whereas real heads focus more on local context. Adding Gaussian noise to imaginary attention degrades long-context performance more severely, over 8 points, than the same perturbation applied to real attention.
  • ...and 1 more figures