Table of Contents
Fetching ...

VideoOrion: Tokenizing Object Dynamics in Videos

Yicheng Feng, Yijiang Li, Wanpeng Zhang, Hao Luo, Zihao Yue, Sipeng Zheng, Zongqing Lu

TL;DR

VideoOrion tackles the challenge of encoding rich video information into a compact, semantically meaningful token stream for LLMs by introducing an explicit object-centric representation via a detect-segment-track pipeline that yields object tokens $o_j \in \mathbb{R}^d$ and context tokens $v_i \in \mathbb{R}^{N_v \times d}$. It employs a two-branch architecture combining a Video-Centric Branch for global context and an Object-Centric Branch for object dynamics across frames. A three-stage training protocol—Video-Centric pretraining, Object-Centric pretraining, and multi-modal instruction tuning—validates the approach and data strategies. Experimental results on multiple VQA and video-based referring benchmarks show competitive performance, with significant gains attributable to object tokens and robust ablations confirming design choices and the pipeline’s interpretability through attention visualizations.

Abstract

We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos - the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detect-segment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our method addresses the persistent challenge in Video-LLMs of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs. Compared to prior methods which resort to downsampling the original video or aggregating visual tokens using resamplers, leading to information loss and entangled semantics, VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations but also enables explicit object modeling of video content with minimal computational cost. Moreover, the introduced object tokens naturally allow VideoOrion to accomplish video-based referring tasks. Experimental results show that VideoOrion can learn to make good use of the object tokens, and achieves competitive results on both general video question answering and video-based referring benchmarks.

VideoOrion: Tokenizing Object Dynamics in Videos

TL;DR

VideoOrion tackles the challenge of encoding rich video information into a compact, semantically meaningful token stream for LLMs by introducing an explicit object-centric representation via a detect-segment-track pipeline that yields object tokens and context tokens . It employs a two-branch architecture combining a Video-Centric Branch for global context and an Object-Centric Branch for object dynamics across frames. A three-stage training protocol—Video-Centric pretraining, Object-Centric pretraining, and multi-modal instruction tuning—validates the approach and data strategies. Experimental results on multiple VQA and video-based referring benchmarks show competitive performance, with significant gains attributable to object tokens and robust ablations confirming design choices and the pipeline’s interpretability through attention visualizations.

Abstract

We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos - the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detect-segment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our method addresses the persistent challenge in Video-LLMs of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs. Compared to prior methods which resort to downsampling the original video or aggregating visual tokens using resamplers, leading to information loss and entangled semantics, VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations but also enables explicit object modeling of video content with minimal computational cost. Moreover, the introduced object tokens naturally allow VideoOrion to accomplish video-based referring tasks. Experimental results show that VideoOrion can learn to make good use of the object tokens, and achieves competitive results on both general video question answering and video-based referring benchmarks.

Paper Structure

This paper contains 26 sections, 8 figures, 13 tables.

Figures (8)

  • Figure 1: With explicit modeling of object dynamics, VideoOrion can (a) grasp finer details (b) with understanding on object-related fine-grained details. (c) Comparison with prior encoding including: (1) spatial pooling the whole frame into a single token; (2) concatenating adjacent patch tokens into a single token; (3) Q-Former aggregates patch tokens with learnable queries. (4) VideoOrion with object tokens providing disentangled semantics.
  • Figure 2: The overall architecture of VideoOrion. Two branches are employed to encode the video content into tokens: the Video-Centric Branch encodes the general information with context tokens, while the Object-Centric Branch encodes the dynamics of objects through the detect-segment-track pipelines in the video into a set of object tokens. All these tokens are fed together to the LLM for integrating information from both branches and generating responses to the text inputs.
  • Figure 3: Case studies showing how VideoOrion utilizes object tokens to generate responses based on different instructions.
  • Figure 4: Examples of the detect-segment-track pipeline.
  • Figure 5: Qualitative examples of VideoOrion$+$, VideoOrion-Ref and VideoOrion-Ref-FT$+$.
  • ...and 3 more figures