Table of Contents
Fetching ...

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin

TL;DR

This work tackles the problem of extending Rotary Position Embedding to video by identifying four essential properties for effective spatiotemporal encoding. It introduces VideoRoPE, a 3D RoPE design with Low-frequency Temporal Allocation (LTA), Diagonal Layout (DL), and Adjustable Temporal Spacing (ATS) to preserve spatio-temporal relationships and reduce temporal oscillations. A challenging V-NIAH-D task is proposed to reveal distractor sensitivity in existing RoPE variants, motivating the new design. Empirical results across long video understanding, retrieval, and hallucination benchmarks show VideoRoPE consistently outperforms prior RoPE variants, including M-RoPE, demonstrating improved robustness and long-context modeling for video-language tasks.

Abstract

While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbf{VideoRoPE}, with a \textit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features \textit{low-frequency temporal allocation} to mitigate periodic oscillations, a \textit{diagonal layout} to maintain spatial symmetry, and \textit{adjustable temporal spacing} to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at \href{https://github.com/Wiselnn570/VideoRoPE}{https://github.com/Wiselnn570/VideoRoPE}.

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

TL;DR

This work tackles the problem of extending Rotary Position Embedding to video by identifying four essential properties for effective spatiotemporal encoding. It introduces VideoRoPE, a 3D RoPE design with Low-frequency Temporal Allocation (LTA), Diagonal Layout (DL), and Adjustable Temporal Spacing (ATS) to preserve spatio-temporal relationships and reduce temporal oscillations. A challenging V-NIAH-D task is proposed to reveal distractor sensitivity in existing RoPE variants, motivating the new design. Empirical results across long video understanding, retrieval, and hallucination benchmarks show VideoRoPE consistently outperforms prior RoPE variants, including M-RoPE, demonstrating improved robustness and long-context modeling for video-language tasks.

Abstract

While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbf{VideoRoPE}, with a \textit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features \textit{low-frequency temporal allocation} to mitigate periodic oscillations, a \textit{diagonal layout} to maintain spatial symmetry, and \textit{adjustable temporal spacing} to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at \href{https://github.com/Wiselnn570/VideoRoPE}{https://github.com/Wiselnn570/VideoRoPE}.

Paper Structure

This paper contains 22 sections, 7 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: VideoRoPE outperforms RoPE variants on benchmarks.
  • Figure 2: Left: To demonstrate the importance of frequential allocation, based on VIAH (a) we present a more challenging V-NIAH-D task (b) that similar images are inserted as distractors. Right: Compared to M-RoPE, our VideoRoPE is more robust in retrieval and is less affected by distractors. See Fig. \ref{['fig:v-niah-and-d']} in the Experiments section for details on the horizontal and vertical axes.
  • Figure 3: Attention-based frequential allocation analysis. Middle: M-RoPE's temporal dimension ($t$) is limited to local information, resulting in a diagonal layout. Bottom: VideoRoPE effectively retrieves the needle using the temporal dimension. The x and y coordinates represent the video frame number, e.g., 50 for 50 frames. For more details see Appendix \ref{['app:attention_analysis']}.
  • Figure 4: (a) M-RoPE wang2024qwen2 models temporal dependencies using the first 16 rotary angles, which exhibit higher frequencies and more pronounced oscillations. (b) In contrast, VideoRoPE models temporal dependencies using the last 16 rotary angles, characterized by significantly wider, monotonic intervals. Our frequency allocation effectively mitigates the misleading influence of distractors in V-NIAH-D. For a more detailed analysis, please refer to Appendix \ref{['app:supp_explain_modules']}.
  • Figure 5: The position embeddings of adjacent text tokens for Vanilla RoPE (top row), the corresponding visual tokens in adjacent frames for M-RoPE (middle row) and our VideoRoPE (bottom row) with interleaved spatial and temporal last design.
  • ...and 4 more figures