Table of Contents
Fetching ...

LinVT: Empower Your Image-level Large Language Model to Understand Videos

Lishuai Gao, Yujie Zhong, Yingsen Zeng, Haoxian Tan, Dengjie Li, Zheng Zhao

TL;DR

LinVT addresses the challenge of extending image-level LLMs to video understanding without training from scratch. It introduces a plug-and-play Linear Video Tokenizer with two components: SVR and TTA, enforcing linearity and representative information condensation to preserve image-language grounding while producing compact video tokens. The model is trained in two stages on 2.9M video-text pairs plus video instruction data, achieving state-of-the-art results on open-ended video QA and strong performance on long-form benchmarks across six base LLMs. This approach provides a practical, resource-efficient route to robust multi-modal video understanding with existing image-focused LLMs.

Abstract

Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.

LinVT: Empower Your Image-level Large Language Model to Understand Videos

TL;DR

LinVT addresses the challenge of extending image-level LLMs to video understanding without training from scratch. It introduces a plug-and-play Linear Video Tokenizer with two components: SVR and TTA, enforcing linearity and representative information condensation to preserve image-language grounding while producing compact video tokens. The model is trained in two stages on 2.9M video-text pairs plus video instruction data, achieving state-of-the-art results on open-ended video QA and strong performance on long-form benchmarks across six base LLMs. This approach provides a practical, resource-efficient route to robust multi-modal video understanding with existing image-focused LLMs.

Abstract

Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.

Paper Structure

This paper contains 24 sections, 1 equation, 6 figures, 14 tables.

Figures (6)

  • Figure 1: By being trained on video data, LinVT can endow an image-based LLM with the capability to handle video understanding tasks and achieve outstanding performance.
  • Figure 2: The framework of the LinVT-based video-LLM. The LinVT module takes visual tokens corresponding to individual frames of a video along with the user instruction as input and then generates compact and fixed-size visual tokens. By using LinVT, an image-LLM can be easily converted to a video-LLM. Firstly, the visual tokens undergo the spatio-temporal visual token refiner SVR (Sec. \ref{['sec:method_1']}) which produces multi-scale visual tokens. The multi-scale visual tokens are then fed to the text-conditioned token aggregator TTA (Sec. \ref{['sec:method_2']}). Finally, the LLM incorporates both the user instruction and the output visual tokens to provide a response for video understanding. The proposed LinVT operates linearly, enabling the preservation of knowledge from the image-LLM.
  • Figure 3: The left part represents single-scale token processing, while the right part contains three variants of multi-scale token processing in LinVT. For a fair comparison, all variants maintain the same output visual token size.
  • Figure 4: Visualization of the patches corresponding to the selected tokens in video frames. Each row corresponds to a video. The selection is achieved by the spatio-temporal significance scoring and top-$k$ selection. The red patches in the image represent the selected tokens. This token scoring and selection mechanism directs the attention of the model towards the most prominent objects, actions, or scenes within the video. LinVT-InternVL2-8B is used for this visualization.
  • Figure 5: Visualization of the patches corresponding to the selected tokens in video frames. The selection is achieved by the spatio-temporal significance scoring and top-k selection. The red patches in the image denote the selected tokens. This token scoring and selection mechanism directs the model's attention towards the most prominent objects, actions, or scenes within the video. LinVT-InternVL2-8B is used for this visualization.
  • ...and 1 more figures