Table of Contents
Fetching ...

VRoPE: Rotary Position Embedding for Video Large Language Models

Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, Jing Liu

TL;DR

This work tackles the challenge of encoding spatiotemporal positions in Video-LLMs, where vanilla RoPE and prior RoPE-3D adaptations exhibit positional bias and cross-modal discontinuities. It introduces Video Rotary Position Embedding (VRoPE), which combines Symmetric Bias Mitigation and Temporal Centered Arrangement to balance spatial attention and align video frames with the textual axis without adding learnable parameters. Empirical results across multiple backbones and diverse benchmarks show VRoPE consistently surpassing RoPE and RoPE-3D in video understanding, temporal reasoning, and long-video retrieval, with strong extrapolation to very long sequences. The method promises a robust, scalable, and practical positional encoding approach for Video-LLMs, enabling improved video-language understanding across model scales and datasets.

Abstract

Rotary Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs), but extending it to video remains a challenge due to the intricate spatiotemporal structure of video frames. Existing adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations: positional bias in attention distribution and disruptions in video-text transitions. To overcome these issues, we propose Video Rotary Position Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs. Specifically, we introduce a more balanced encoding strategy that mitigates attention biases, ensuring a more uniform distribution of spatial focus. Additionally, our approach restructures positional indices to ensure a smooth transition between video and text tokens. Extensive experiments on different models demonstrate that VRoPE consistently outperforms previous RoPE variants, achieving significant improvements in video understanding, temporal reasoning, and retrieval tasks. Code is available at https://github.com/johncaged/VRoPE.

VRoPE: Rotary Position Embedding for Video Large Language Models

TL;DR

This work tackles the challenge of encoding spatiotemporal positions in Video-LLMs, where vanilla RoPE and prior RoPE-3D adaptations exhibit positional bias and cross-modal discontinuities. It introduces Video Rotary Position Embedding (VRoPE), which combines Symmetric Bias Mitigation and Temporal Centered Arrangement to balance spatial attention and align video frames with the textual axis without adding learnable parameters. Empirical results across multiple backbones and diverse benchmarks show VRoPE consistently surpassing RoPE and RoPE-3D in video understanding, temporal reasoning, and long-video retrieval, with strong extrapolation to very long sequences. The method promises a robust, scalable, and practical positional encoding approach for Video-LLMs, enabling improved video-language understanding across model scales and datasets.

Abstract

Rotary Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs), but extending it to video remains a challenge due to the intricate spatiotemporal structure of video frames. Existing adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations: positional bias in attention distribution and disruptions in video-text transitions. To overcome these issues, we propose Video Rotary Position Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs. Specifically, we introduce a more balanced encoding strategy that mitigates attention biases, ensuring a more uniform distribution of spatial focus. Additionally, our approach restructures positional indices to ensure a smooth transition between video and text tokens. Extensive experiments on different models demonstrate that VRoPE consistently outperforms previous RoPE variants, achieving significant improvements in video understanding, temporal reasoning, and retrieval tasks. Code is available at https://github.com/johncaged/VRoPE.

Paper Structure

This paper contains 39 sections, 9 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Comparison of RoPE, RoPE-3D, and our VRoPE in video positional encoding. (a) Positional Unbiasedness: RoPE and RoPE-3D exhibit spatial biased attention, particularly towards later tokens or specific frame regions, while VRoPE ensures more uniform attention. (b) Seamless Video-Text Transition: RoPE-3D causes a discontinuity when transitioning from video to text tokens, which VRoPE smooths for better cross-modal dependency modeling.
  • Figure 2: Attention weight visualization of RoPE, RoPE-3D, and VRoPE. We compute average text-to-video frame attention weights on VideoMME fu2024video benchmark (lighter color indicates higher attention). (a) RoPE exhibits row-wise attention decay within frames. (b) RoPE-3D shows a similar decay from the bottom-right to the top-left, introducing positional bias that skews attention toward spatially closer frame tokens. (c) VRoPE mitigates this bias, leading to a more balanced attention distribution.
  • Figure 3: Left: the overall architecture of a typical Video-LLM. In this work, our improvements primarily target the positional embedding component of the LLM to enhance its video understanding capability. Right: method illustration of VRoPE. (a) We first apply symmetric arrangement to mitigate positional bias in video frames. The RoPE frequencies are uniformly allocated to the four dimensions. (b) We propose to use temporal centered arrangement in video frames to form a seamless video-text transition, which enables video input of arbitrary length without causing discontinuity.
  • Figure 4: Visualization of long video retrieval results on Video-NIAH zhao2024needle. Our VRoPE consistently achieves high accuracy across varying background lengths and needle depths, showing strong retrieval capability in long videos.
  • Figure 5: Attention weight visualization of RoPE, RoPE-3D, and VRoPE. The visualization reveals that VRoPE exhibits stronger attention activation within critical frames (highlighted by red boxes), demonstrating its accurate focus on pivotal spatiotemporal regions. In contrast, RoPE and RoPE-3D display attenuated attention responses in these corresponding areas, indicating insufficient awareness of key events. This attention misalignment consequently leads to erroneous predictions, as evidenced by their incorrect interpretations of the visual content.