Table of Contents
Fetching ...

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang

TL;DR

Dynamic-VLM tackles the need for scalable VideoLLMs by introducing a dynamic visual token compressor that adapts to video length and token budgets. It also builds a large synthetic video-text dataset to train and evaluate VideoLLMs across open-ended, multiple-choice, and multi-image QA tasks. Empirical results show state-of-the-art performance on VideoMME, MuirBench, and other benchmarks, with notable gains over baselines. The work suggests that video-focused training and flexible token representations can improve generalization while controlling compute, enabling longer videos and richer cross-modal reasoning.

Abstract

The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos. In this study, we introduce a large-scale synthetic dataset created from proprietary models, using carefully designed prompts to tackle a wide range of questions. We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance. Our proposed \model{} achieves state-of-the-art results across various video tasks and shows impressive generalization, setting new baselines in multi-image understanding. Notably, \model{} delivers an absolute improvement of 2.7\% over LLaVA-OneVision on VideoMME and 10.7\% on MuirBench. Codes are available at https://github.com/Hon-Wong/ByteVideoLLM

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

TL;DR

Dynamic-VLM tackles the need for scalable VideoLLMs by introducing a dynamic visual token compressor that adapts to video length and token budgets. It also builds a large synthetic video-text dataset to train and evaluate VideoLLMs across open-ended, multiple-choice, and multi-image QA tasks. Empirical results show state-of-the-art performance on VideoMME, MuirBench, and other benchmarks, with notable gains over baselines. The work suggests that video-focused training and flexible token representations can improve generalization while controlling compute, enabling longer videos and richer cross-modal reasoning.

Abstract

The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos. In this study, we introduce a large-scale synthetic dataset created from proprietary models, using carefully designed prompts to tackle a wide range of questions. We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance. Our proposed \model{} achieves state-of-the-art results across various video tasks and shows impressive generalization, setting new baselines in multi-image understanding. Notably, \model{} delivers an absolute improvement of 2.7\% over LLaVA-OneVision on VideoMME and 10.7\% on MuirBench. Codes are available at https://github.com/Hon-Wong/ByteVideoLLM

Paper Structure

This paper contains 21 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Demonstration of previous works and our Dynamic-VLM. We use a flexible token compressor for visual content, enabling us to show videos of different lengths with varying token counts. For short videos, we keep tokens uncompressed to provide detailed information, and for long videos, we use a high compression ratio to enhance temporal details. For the sake of simplicity, visual encoders are excluded from the illustration.
  • Figure 2: For each video, we independently extract visual tokens for each key frame using a ViT. These visual tokens are then compressed using dynamic compressors before being input to the LLM, along with timestamp text and instructions. We discuss three potential candidates for dynamic compressors.
  • Figure 3: Distribution of different tasks in our synthetical data.