Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

Shimin Chen; Yitian Yuan; Shaoxiang Chen; Zequn Jie; Lin Ma

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

Shimin Chen, Yitian Yuan, Shaoxiang Chen, Zequn Jie, Lin Ma

TL;DR

Fewer Tokens and Fewer Videos LVLM (FTFV-LVLM) tackles data and computation bottlenecks in video understanding by extending an image-based LVLM with a weighted token sampler. The approach reuses a shared ViT encoder and vision-language adapter, adding per-frame token compression and three training strategies that use as little as 10% of prior video data. Empirical results show competitive to state-of-the-art performance on both image and video benchmarks, with particular strength in temporal reasoning under limited data. The work highlights the importance of temporal-reasoning video data and diverse instructional content for cost-efficient video-LVLM development.

Abstract

Amidst the advancements in image-based Large Vision-Language Models (image-LVLM), the transition to video-based models (video-LVLM) is hindered by the limited availability of quality video data. This paper addresses the challenge by leveraging the visual commonalities between images and videos to efficiently evolve image-LVLMs into video-LVLMs. We present a cost-effective video-LVLM that enhances model architecture, introduces innovative training strategies, and identifies the most effective types of video instruction data. Our innovative weighted token sampler significantly compresses the visual token numbers of each video frame, effectively cutting computational expenses. We also find that judiciously using just 10% of the video data, compared to prior video-LVLMs, yields impressive results during various training phases. Moreover, we delve into the influence of video instruction data in limited-resource settings, highlighting the significance of incorporating video training data that emphasizes temporal understanding to enhance model performance. The resulting Fewer Tokens and Fewer Videos LVLM (FTFV-LVLM) exhibits exceptional performance across video and image benchmarks, validating our model's design and training approaches.

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

TL;DR

Abstract

Paper Structure (19 sections, 2 equations, 4 figures, 6 tables)

This paper contains 19 sections, 2 equations, 4 figures, 6 tables.

Introduction
Related work
Large Language Models
Large Vision-Language Models
FTFV-LLM
The Basic Image-LVLM
Model Architecture
Training Pipeline
The Extended Video-LVLM
The Weighted Token Sampler Module
Video-Incorporated Training
Experiments
Experimental Setup
Results on Image Benchmarks
Results on Video Benchmarks
...and 4 more sections

Figures (4)

Figure 1: The comparison of our proposed FTFV-LLM to existing video-LVLMs on ActivityNet-QA caba2015activitynet and Video-Bench ning2023video. The horizontal axis indicates the quantity of QA pairs utilized for video instruction tuning in each model, while the vertical axis shows the accuracy achieved on the two benchmarks. Different colors are used to distinguish the number of tokens representing each video frame in these models. Our FTFV-LLM, which utilizes fewer video tokens and fewer video training data, achieves leading results over previous video-LVLMs. Meanwhile, our FTFV-LLM with different token numbers (16 vs. 128) also perform similarly.
Figure 2: An overview of our proposed FTFV-LLM model. The FTFV-LLM extends from a basic image-LVLM architecture, with a visual encoder firstly encodes each video frame, and then a vision-language adapter modulates the video tokens to align with the LLM feature space. Besides, we propose a novel weighted token sampler module, which can largely compress the token numbers of each video frame, and thus it is beneficial to save the calculation cost of the model when processing multiple video frames. Finally, the compressed video tokens as well as the text prompt tokens are feed to the LLM, thus getting the response output. During our training process, we finetune the model using a combination of video and image data, with comprehensive details available in the main paper.
Figure 3: Exploring the effects of varying video data proportions and training strategies. Here, S4-V$_{x\%}$ means we use $x$% video data in the S4-V video instruction tuning stage, S3-IV$_{x\%}$ indicates we use $x$% video data in the S3-IV stage, and the S2-S3-IV$_{x\%-y\%}$ refers that we incorporate $x$% pretraining and $y$% instruction tuning video data in the second and third model training stages, respectively. All stages incorporate the complete set of image data as shown in Table \ref{['tab:image_training_data']}.
Figure 4: Qualitative examples for our FTFV-LLM.

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

TL;DR

Abstract

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)