Table of Contents
Fetching ...

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han

TL;DR

This work tackles cross-modal positional biases that arise when applying Rotary Position Embedding to vision-language models. It introduces Circle-RoPE, which decouples text and image positional encodings by projecting image token indices onto a circle in 3D space, forming a cone-like structure that preserves intra-image spatial relations; a Per-Token Distance metric is used to quantify cross-modal independence. The method comprises Circular Image Token Projection (CIP) with coordinate centralization, mixed-angle circular mapping, and target plane rotation, plus Alternating Geometry Encoding (AGE) to alternate Circle-RoPE and M-RoPE across layers; it also encodes temporal order in multi-image sequences. Empirical results show reduced cross-modal bias, preserved spatial information, and improved performance across LVLMs and architectures, with attention visualizations illustrating more focused cross-modal reasoning. The approach is released with code at the provided GitHub link, offering a practical, modular enhancement for robust multimodal understanding.

Abstract

Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to vision-language models (VLMs), RoPE and its variants enforce relative positional dependencies separately within text and image tokens, introducing unintended cross-modal positional biases. For example, image tokens depicting semantically consistent content are assigned distinct positional encodings solely due to spatial location variations. As a result, such tokens exhibit entirely different relative positional relationships with their corresponding text tokens, ultimately leading to misaligned cross-modal representations. To address this, we propose Per-Token Distance, a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme designed to eliminate spurious cross-modal biases. Our key idea is to project image token indices onto a \emph{ring} that is orthogonal to the linear axis of text token indices, thereby forming a cone-like structure in the positional encoding space. In this configuration, each text token (point on the linear text axis) becomes the apex of a cone and maintains an equal distance to all image tokens (points on the circular image \emph{ring}), reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered strategy that applies different RoPE variants across layers. Extensive experiments demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for VLMs. The code is available at https://github.com/lose4578/CircleRoPE.

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

TL;DR

This work tackles cross-modal positional biases that arise when applying Rotary Position Embedding to vision-language models. It introduces Circle-RoPE, which decouples text and image positional encodings by projecting image token indices onto a circle in 3D space, forming a cone-like structure that preserves intra-image spatial relations; a Per-Token Distance metric is used to quantify cross-modal independence. The method comprises Circular Image Token Projection (CIP) with coordinate centralization, mixed-angle circular mapping, and target plane rotation, plus Alternating Geometry Encoding (AGE) to alternate Circle-RoPE and M-RoPE across layers; it also encodes temporal order in multi-image sequences. Empirical results show reduced cross-modal bias, preserved spatial information, and improved performance across LVLMs and architectures, with attention visualizations illustrating more focused cross-modal reasoning. The approach is released with code at the provided GitHub link, offering a practical, modular enhancement for robust multimodal understanding.

Abstract

Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to vision-language models (VLMs), RoPE and its variants enforce relative positional dependencies separately within text and image tokens, introducing unintended cross-modal positional biases. For example, image tokens depicting semantically consistent content are assigned distinct positional encodings solely due to spatial location variations. As a result, such tokens exhibit entirely different relative positional relationships with their corresponding text tokens, ultimately leading to misaligned cross-modal representations. To address this, we propose Per-Token Distance, a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme designed to eliminate spurious cross-modal biases. Our key idea is to project image token indices onto a \emph{ring} that is orthogonal to the linear axis of text token indices, thereby forming a cone-like structure in the positional encoding space. In this configuration, each text token (point on the linear text axis) becomes the apex of a cone and maintains an equal distance to all image tokens (points on the circular image \emph{ring}), reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered strategy that applies different RoPE variants across layers. Extensive experiments demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for VLMs. The code is available at https://github.com/lose4578/CircleRoPE.

Paper Structure

This paper contains 23 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Text (yellow) and image (green) tokens are labeled with their position indices under different RoPE-based encoding schemes. (a) hard embedding method, which encodes image tokens by their flattened sequence; (b) unordered embedding method, assigning the same index to all image tokens within an image; (c) spatial embedding method, where image tokens are indexed according to their 2D positions in the original image; (d) our method, which remaps image token index onto a circle orthogonal to the text index direction, achieving a decoupled encoding.
  • Figure 2: A VQA Example where image and text tokens are sequentially concatenated. The image token at index 8 exhibits the smallest RoPE distance to all text tokens, despite semantically closer image tokens being located elsewhere. The text token at index 16 exhibits varying distances to the six image patches that correspond to the same semantic content. These misalignments highlights how conventional RoPE methods introduce unintended relative positional biases.
  • Figure 3: Transformation steps for Circular Image Token Index Projection (CIP): (i) coordinate centralization, (ii) mixed-angle circular mapping, and (iii) target plane rotation as described in Sec \ref{['sec:CIP']}. For clarity, the starting points of text and image indices are aligned in above figure, preserving their relative positional distances without loss of generality. (a) Initial M-RoPE wang2024qwen2 index in step (i); (b) 2D circular structure after steps (i) and (ii); (c) 3D circular structure after step (iii); (d) Grid-index angle (GA) in step (ii); (e) Spatial-origin angle (SA) in step (ii).
  • Figure :
  • Figure :