LinVT: Empower Your Image-level Large Language Model to Understand Videos
Lishuai Gao, Yujie Zhong, Yingsen Zeng, Haoxian Tan, Dengjie Li, Zheng Zhao
TL;DR
LinVT addresses the challenge of extending image-level LLMs to video understanding without training from scratch. It introduces a plug-and-play Linear Video Tokenizer with two components: SVR and TTA, enforcing linearity and representative information condensation to preserve image-language grounding while producing compact video tokens. The model is trained in two stages on 2.9M video-text pairs plus video instruction data, achieving state-of-the-art results on open-ended video QA and strong performance on long-form benchmarks across six base LLMs. This approach provides a practical, resource-efficient route to robust multi-modal video understanding with existing image-focused LLMs.
Abstract
Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.
