TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models
Xiangtian Zheng, Zishuo Wang, Yuxin Peng
TL;DR
TiFRe tackles the efficiency bottleneck in Video MLLMs by introducing a text-guided frame reduction pipeline that selects semantically relevant key frames conditioned on the user prompt and merges information from non-key frames to preserve context. The framework comprises Text-guided Frame Sampling (TFS), which uses an LLM to generate CLIP-style prompts and CLIP-based frame scoring, and Frame Matching and Merging (FMM), which semantically fuses non-key frames into key frames through weighted averaging. Empirical results on VNBench and MLVU show that TiFRe reduces the number of input frames (e.g., from 55.2 to 8.6) while achieving higher accuracy than fixed-FPS baselines and existing frame-reduction methods, demonstrating both improved efficiency and performance across multiple LLM backbones. Overall, TiFRe offers a practical, prompt-responsive solution for scalable video understanding with potential for real-world deployment in long-form video QA and related tasks.
Abstract
With the rapid development of Large Language Models (LLMs), Video Multi-Modal Large Language Models (Video MLLMs) have achieved remarkable performance in video-language tasks such as video understanding and question answering. However, Video MLLMs face high computational costs, particularly in processing numerous video frames as input, which leads to significant attention computation overhead. A straightforward approach to reduce computational costs is to decrease the number of input video frames. However, simply selecting key frames at a fixed frame rate (FPS) often overlooks valuable information in non-key frames, resulting in notable performance degradation. To address this, we propose Text-guided Video Frame Reduction (TiFRe), a framework that reduces input frames while preserving essential video information. TiFRe uses a Text-guided Frame Sampling (TFS) strategy to select key frames based on user input, which is processed by an LLM to generate a CLIP-style prompt. Pre-trained CLIP encoders calculate the semantic similarity between the prompt and each frame, selecting the most relevant frames as key frames. To preserve video semantics, TiFRe employs a Frame Matching and Merging (FMM) mechanism, which integrates non-key frame information into the selected key frames, minimizing information loss. Experiments show that TiFRe effectively reduces computational costs while improving performance on video-language tasks.
