Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang
TL;DR
Dynamic-VLM tackles the need for scalable VideoLLMs by introducing a dynamic visual token compressor that adapts to video length and token budgets. It also builds a large synthetic video-text dataset to train and evaluate VideoLLMs across open-ended, multiple-choice, and multi-image QA tasks. Empirical results show state-of-the-art performance on VideoMME, MuirBench, and other benchmarks, with notable gains over baselines. The work suggests that video-focused training and flexible token representations can improve generalization while controlling compute, enabling longer videos and richer cross-modal reasoning.
Abstract
The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos. In this study, we introduce a large-scale synthetic dataset created from proprietary models, using carefully designed prompts to tackle a wide range of questions. We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance. Our proposed \model{} achieves state-of-the-art results across various video tasks and shows impressive generalization, setting new baselines in multi-image understanding. Notably, \model{} delivers an absolute improvement of 2.7\% over LLaVA-OneVision on VideoMME and 10.7\% on MuirBench. Codes are available at https://github.com/Hon-Wong/ByteVideoLLM
