KeyVideoLLM: Towards Large-scale Video Keyframe Selection

Hao Liang; Jiapeng Li; Tianyi Bai; Xijie Huang; Linzhuang Sun; Zhengren Wang; Conghui He; Bin Cui; Chong Chen; Wentao Zhang

KeyVideoLLM: Towards Large-scale Video Keyframe Selection

Hao Liang, Jiapeng Li, Tianyi Bai, Xijie Huang, Linzhuang Sun, Zhengren Wang, Conghui He, Bin Cui, Chong Chen, Wentao Zhang

TL;DR

The paper tackles the data management challenges of VideoLLMs caused by massive video data. It introduces KeyVideoLLM, a text-video frame similarity-guided, coarse-to-fine keyframe selector that leverages CLIP embeddings to extract frames relevant to queries for both training and inference. The approach achieves up to 60.9× data compression, up to 200× faster keyframe selection, and requires essentially no hyperparameter tuning, while improving QA performance across multiple benchmarks and maintaining SoTA results. The method generalizes across architectures and demonstrates the value of data-centric strategies for scalable, robust video understanding.

Abstract

Recently, with the rise of web videos, managing and understanding large-scale video datasets has become increasingly important. Video Large Language Models (VideoLLMs) have emerged in recent years due to their strong video understanding capabilities. However, training and inference processes for VideoLLMs demand vast amounts of data, presenting significant challenges to data management, particularly regarding efficiency, robustness, and effectiveness. In this work, we present KeyVideoLLM, a text-video frame similarity-based keyframe selection method designed to manage VideoLLM data efficiently, robustly, and effectively. Specifically, KeyVideoLLM achieves a remarkable data compression rate of up to 60.9 times, substantially lowering disk space requirements, which proves its high efficiency. Additionally, it maintains a 100% selection success rate across all video formats and scales, enhances processing speed by up to 200 times compared to existing keyframe selection methods, and does not require hyperparameter tuning. Beyond its outstanding efficiency and robustness, KeyVideoLLM further improves model performance in video question-answering tasks during both training and inference stages. Notably, it consistently achieved the state-of-the-art (SoTA) experimental results on diverse datasets.

KeyVideoLLM: Towards Large-scale Video Keyframe Selection

TL;DR

Abstract

Paper Structure (29 sections, 2 equations, 6 figures, 6 tables)

This paper contains 29 sections, 2 equations, 6 figures, 6 tables.

Introduction
Related Work
Video Multimodal Models.
Keyframe Selection for Video Multimodal Models.
Data-Centric LLMs and Data Selection Methods
Method
Keyframe Selection
Cluster
Video Summarization
Text-Video Frame Similarity Based Keyframe Selection
Coarse Level Keyframe Selection
Fine Level Keyframe Selection
CLIP-based Keyframe Selection for VideoLLM Training
CLIP-based Keyframe Selection for VideoLLM Inference
Experiments
...and 14 more sections

Figures (6)

Figure 1: In (a), uniform frame selection often results in images that lack the information required to answer the question. In contrast, KeyframeLLM ensures the frames contain the necessary information. In (b), KeyVideoLLM improves the performance of VideoLLM across all benchmarks.
Figure 2: Comparison of three methods for keyframe selection. To the best of our knowledge, this is the first study to select video frames using text-video frames matching for VideoLLMs.
Figure 3: We propose a frame selection method based on text-video frame matching. The method follows a coarse-to-fine framework. We use pre-trained models to select information from text and video frames.
Figure 4: We use all the keyframe selection methods in the instruction tuning stage.
Figure 5: Compression Ratios of Various Methods on Different Datasets. The graph illustrates the compression ratios achieved by our model KeyVideoLLM compared to Katna and DSNet across five different datasets. Higher ratios indicate more efficient compression, demonstrating the superior computational and storage efficiency of our approach.
...and 1 more figures

KeyVideoLLM: Towards Large-scale Video Keyframe Selection

TL;DR

Abstract

KeyVideoLLM: Towards Large-scale Video Keyframe Selection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)