Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han
TL;DR
This work tackles cross-modal positional biases that arise when applying Rotary Position Embedding to vision-language models. It introduces Circle-RoPE, which decouples text and image positional encodings by projecting image token indices onto a circle in 3D space, forming a cone-like structure that preserves intra-image spatial relations; a Per-Token Distance metric is used to quantify cross-modal independence. The method comprises Circular Image Token Projection (CIP) with coordinate centralization, mixed-angle circular mapping, and target plane rotation, plus Alternating Geometry Encoding (AGE) to alternate Circle-RoPE and M-RoPE across layers; it also encodes temporal order in multi-image sequences. Empirical results show reduced cross-modal bias, preserved spatial information, and improved performance across LVLMs and architectures, with attention visualizations illustrating more focused cross-modal reasoning. The approach is released with code at the provided GitHub link, offering a practical, modular enhancement for robust multimodal understanding.
Abstract
Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to vision-language models (VLMs), RoPE and its variants enforce relative positional dependencies separately within text and image tokens, introducing unintended cross-modal positional biases. For example, image tokens depicting semantically consistent content are assigned distinct positional encodings solely due to spatial location variations. As a result, such tokens exhibit entirely different relative positional relationships with their corresponding text tokens, ultimately leading to misaligned cross-modal representations. To address this, we propose Per-Token Distance, a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme designed to eliminate spurious cross-modal biases. Our key idea is to project image token indices onto a \emph{ring} that is orthogonal to the linear axis of text token indices, thereby forming a cone-like structure in the positional encoding space. In this configuration, each text token (point on the linear text axis) becomes the apex of a cone and maintains an equal distance to all image tokens (points on the circular image \emph{ring}), reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered strategy that applies different RoPE variants across layers. Extensive experiments demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for VLMs. The code is available at https://github.com/lose4578/CircleRoPE.
