VideoOrion: Tokenizing Object Dynamics in Videos

Yicheng Feng; Yijiang Li; Wanpeng Zhang; Hao Luo; Zihao Yue; Sipeng Zheng; Zongqing Lu

VideoOrion: Tokenizing Object Dynamics in Videos

Yicheng Feng, Yijiang Li, Wanpeng Zhang, Hao Luo, Zihao Yue, Sipeng Zheng, Zongqing Lu

TL;DR

VideoOrion tackles the challenge of encoding rich video information into a compact, semantically meaningful token stream for LLMs by introducing an explicit object-centric representation via a detect-segment-track pipeline that yields object tokens $o_j \in \mathbb{R}^d$ and context tokens $v_i \in \mathbb{R}^{N_v \times d}$. It employs a two-branch architecture combining a Video-Centric Branch for global context and an Object-Centric Branch for object dynamics across frames. A three-stage training protocol—Video-Centric pretraining, Object-Centric pretraining, and multi-modal instruction tuning—validates the approach and data strategies. Experimental results on multiple VQA and video-based referring benchmarks show competitive performance, with significant gains attributable to object tokens and robust ablations confirming design choices and the pipeline’s interpretability through attention visualizations.

Abstract

We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos - the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detect-segment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our method addresses the persistent challenge in Video-LLMs of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs. Compared to prior methods which resort to downsampling the original video or aggregating visual tokens using resamplers, leading to information loss and entangled semantics, VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations but also enables explicit object modeling of video content with minimal computational cost. Moreover, the introduced object tokens naturally allow VideoOrion to accomplish video-based referring tasks. Experimental results show that VideoOrion can learn to make good use of the object tokens, and achieves competitive results on both general video question answering and video-based referring benchmarks.

VideoOrion: Tokenizing Object Dynamics in Videos

TL;DR

Abstract

VideoOrion: Tokenizing Object Dynamics in Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)