Table of Contents
Fetching ...

Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding

Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, Bo Zhao

TL;DR

Video-XL-Pro tackles extremely long video understanding by introducing Reconstructive Token Compression (ReCoT), which combines Dynamic Token Synthesizer (DTS) and Semantic-Guided Masking (SGM) to produce comprehensive yet compact video tokens. The approach includes a video-focused training pipeline with dataset pruning and a Query-aware selector to locate query-relevant tokens, enabling efficient fine-tuning of a 3B-parameter LLM. Across multiple long-video benchmarks, Video-XL-Pro matches or surpasses larger models trained on more data while processing thousands of frames on a single GPU, demonstrating a practical, scalable solution for long-form visual reasoning. The work highlights a strong balance between effectiveness and efficiency, offering a path toward capable, resource-conscious multimodal video understanding in real-world settings.

Abstract

Despite advanced token compression techniques, existing multimodal large language models (MLLMs) still struggle with hour-long video understanding. In this work, we propose Video-XL-Pro, an efficient method for extremely long video understanding, built upon Reconstructive Compression of Tokens (ReCoT), a learnable module that leverages self-supervised learning to generate comprehensive and compact video tokens. ReCoT introduces two key components: (i) Dynamic Token Synthesizer (DTS): DTS generates pseudo-video tokens from static image tokens by learning intra-token relationships, which are then used in masked video modeling. (ii) Semantic-Guided Masking (SGM): SGM adaptively masks redundant visual tokens to facilitate more effective reconstructive learning. To improve training efficiency in MLLMs fine-tuning, we introduce a video-specific dataset pruning strategy and design a simple yet Query-aware Selector that enables the model to precisely locate query-relevant video tokens. With only 3B parameters, Video-XL-Pro outperforms most 7B models trained on larger datasets across multiple long video understanding benchmarks. Moreover, it can process over 8K frames on a single A100 GPU while maintaining high-quality performance.

Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding

TL;DR

Video-XL-Pro tackles extremely long video understanding by introducing Reconstructive Token Compression (ReCoT), which combines Dynamic Token Synthesizer (DTS) and Semantic-Guided Masking (SGM) to produce comprehensive yet compact video tokens. The approach includes a video-focused training pipeline with dataset pruning and a Query-aware selector to locate query-relevant tokens, enabling efficient fine-tuning of a 3B-parameter LLM. Across multiple long-video benchmarks, Video-XL-Pro matches or surpasses larger models trained on more data while processing thousands of frames on a single GPU, demonstrating a practical, scalable solution for long-form visual reasoning. The work highlights a strong balance between effectiveness and efficiency, offering a path toward capable, resource-conscious multimodal video understanding in real-world settings.

Abstract

Despite advanced token compression techniques, existing multimodal large language models (MLLMs) still struggle with hour-long video understanding. In this work, we propose Video-XL-Pro, an efficient method for extremely long video understanding, built upon Reconstructive Compression of Tokens (ReCoT), a learnable module that leverages self-supervised learning to generate comprehensive and compact video tokens. ReCoT introduces two key components: (i) Dynamic Token Synthesizer (DTS): DTS generates pseudo-video tokens from static image tokens by learning intra-token relationships, which are then used in masked video modeling. (ii) Semantic-Guided Masking (SGM): SGM adaptively masks redundant visual tokens to facilitate more effective reconstructive learning. To improve training efficiency in MLLMs fine-tuning, we introduce a video-specific dataset pruning strategy and design a simple yet Query-aware Selector that enables the model to precisely locate query-relevant video tokens. With only 3B parameters, Video-XL-Pro outperforms most 7B models trained on larger datasets across multiple long video understanding benchmarks. Moreover, it can process over 8K frames on a single A100 GPU while maintaining high-quality performance.

Paper Structure

This paper contains 21 sections, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Compared with SOTA video understanding MLLMs, Video-XL-Pro achieves better accuracy and greater efficiency simultaneously.
  • Figure 2: Overview of Video-XL-Pro. The top part is the reconstructive token compression, in which we propose reconstructive token compression (ReCoT) to generate comprehensive and compact tokens. The bottom part is the MLLM training stage, where we propose the video dataset pruning strategy and query-aware selector to improve efficiency.
  • Figure 3: Training data distribution of uniform and variable sampling.
  • Figure 4: Results on the Needle-in-a-haystack evaluation within a single A100 80GB GPU. The x-axis represents the total number of frames in the video haystack. The y-axis shows the position where the needle image is located. Gray grids mean "OOM'.
  • Figure 5: The training and inference efficiency of Video-XL-Pro.
  • ...and 2 more figures