Frame-Voyager: Learning to Query Frames for Video Large Language Models

Sicheng Yu; Chengkai Jin; Huanyu Wang; Zhenghao Chen; Sheng Jin; Zhongrong Zuo; Xiaolei Xu; Zhenbang Sun; Bingni Zhang; Jiawei Wu; Hao Zhang; Qianru Sun

Frame-Voyager: Learning to Query Frames for Video Large Language Models

Sicheng Yu, Chengkai Jin, Huanyu Wang, Zhenghao Chen, Sheng Jin, Zhongrong Zuo, Xiaolei Xu, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, Qianru Sun

TL;DR

Frame-Voyager addresses the token-length bottleneck of Video-LLMs by learning to select combinational frame subsets that maximize task-relevant information. It reframes frame selection as a supervised ranking problem, using a pre-trained Video-LLM to score all $inom{M}{T}$ possible $T$-frame combinations and train a lightweight plug-in module to optimize a reward-based objective. The approach demonstrates state-of-the-art or competitive performance across four video QA benchmarks when plugged into diverse Video-LLMs, highlighting improvements on long videos and complex reasoning while reducing data-collection and computation costs through pruning and filtering. This work provides a practical, generalizable framework for efficient and effective video understanding with Video-LLMs, with potential extensions to longer videos and joint pre-training.

Abstract

Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks. However, they are constrained by the maximum length of input tokens, making it impractical to input entire videos. Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos or the complex instructions in the tasks, leading to sub-optimal performance. In this paper, we propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task. To train Frame-Voyager, we introduce a new data collection and labeling pipeline, by ranking frame combinations using a pre-trained Video-LLM. Given a video of M frames, we traverse its T-frame combinations, feed them into a Video-LLM, and rank them based on Video-LLM's prediction losses. Using this ranking as supervision, we train Frame-Voyager to query the frame combinations with lower losses. In experiments, we evaluate Frame-Voyager on four Video Question Answering benchmarks by plugging it into two different Video-LLMs. The experimental results demonstrate that Frame-Voyager achieves impressive results in all settings, highlighting its potential as a plug-and-play solution for Video-LLMs.

Frame-Voyager: Learning to Query Frames for Video Large Language Models

TL;DR

possible

-frame combinations and train a lightweight plug-in module to optimize a reward-based objective. The approach demonstrates state-of-the-art or competitive performance across four video QA benchmarks when plugged into diverse Video-LLMs, highlighting improvements on long videos and complex reasoning while reducing data-collection and computation costs through pruning and filtering. This work provides a practical, generalizable framework for efficient and effective video understanding with Video-LLMs, with potential extensions to longer videos and joint pre-training.

Abstract

Paper Structure (18 sections, 3 equations, 11 figures, 10 tables)

This paper contains 18 sections, 3 equations, 11 figures, 10 tables.

Introduction
Related Work
Frame-Voyager
Data Collection
Model Training and Inference
Experiments
Experiment Settings
Results and Analysis
Conclusions
Implementation Details of Frame Extraction Baselines in RQ1
Question type analysis of the generated datasets
Analysis of Computation Cost
Results on More Benchmarks
A Study on the Number of Candidate Frames
Dynamic Frame Selection
...and 3 more sections

Figures (11)

Figure 1: The data collection pipeline of Frame-Voyager. Given a video of $M$ frames, we traverse its $T$-frame combinations, feed them into a Video-LLM, and rank them based on the reference Video-LLM's prediction losses. At last, we train Frame-Voyager to query the frame combinations with lower losses. Please note that we omit filtering steps in this figure for clarity. $\mathcal{C}(M, T)$ is the binomial coefficient representing the number of ways to choose $T$ items from $M$.
Figure 2: Training and inference processes of Frame-Voyager. In the training, Frame-Voyager is fed with all $M$ candidate frames and learns to rank $K$ sampled combinations from pre-generated combination ranking data in Section \ref{['ssec:data_construct']}. Each combination contains $T$ frames. As for the inference, Frame-Voyager selects top-$T$ frames with highest rewards to form the predicted frame combination. Note that there is no parameter to update during the inference.
Figure 3: Accuracies ($\%$) of uniform sampling and Frame-Voyager on Video-MME (without subtitles) regarding number of frames. Both models use the same number of candidate frames ($128$).
Figure 4: RQ4. Performance of Frame-Voyager reusing different parts of VILA-8B on Video-MME (without subtitles).
Figure 5: RQ5. Performance of Frame-Voyager, CLIP, and uniform frame sampling on six question types of Video-MME.
...and 6 more figures

Frame-Voyager: Learning to Query Frames for Video Large Language Models

TL;DR

Abstract

Frame-Voyager: Learning to Query Frames for Video Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)