VRoPE: Rotary Position Embedding for Video Large Language Models

Zikang Liu; Longteng Guo; Yepeng Tang; Tongtian Yue; Junxian Cai; Kai Ma; Qingbin Liu; Xi Chen; Jing Liu

VRoPE: Rotary Position Embedding for Video Large Language Models

Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, Jing Liu

TL;DR

This work tackles the challenge of encoding spatiotemporal positions in Video-LLMs, where vanilla RoPE and prior RoPE-3D adaptations exhibit positional bias and cross-modal discontinuities. It introduces Video Rotary Position Embedding (VRoPE), which combines Symmetric Bias Mitigation and Temporal Centered Arrangement to balance spatial attention and align video frames with the textual axis without adding learnable parameters. Empirical results across multiple backbones and diverse benchmarks show VRoPE consistently surpassing RoPE and RoPE-3D in video understanding, temporal reasoning, and long-video retrieval, with strong extrapolation to very long sequences. The method promises a robust, scalable, and practical positional encoding approach for Video-LLMs, enabling improved video-language understanding across model scales and datasets.

Abstract

Rotary Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs), but extending it to video remains a challenge due to the intricate spatiotemporal structure of video frames. Existing adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations: positional bias in attention distribution and disruptions in video-text transitions. To overcome these issues, we propose Video Rotary Position Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs. Specifically, we introduce a more balanced encoding strategy that mitigates attention biases, ensuring a more uniform distribution of spatial focus. Additionally, our approach restructures positional indices to ensure a smooth transition between video and text tokens. Extensive experiments on different models demonstrate that VRoPE consistently outperforms previous RoPE variants, achieving significant improvements in video understanding, temporal reasoning, and retrieval tasks. Code is available at https://github.com/johncaged/VRoPE.

VRoPE: Rotary Position Embedding for Video Large Language Models

TL;DR

Abstract

VRoPE: Rotary Position Embedding for Video Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)