Table of Contents
Fetching ...

TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models

Xiangtian Zheng, Zishuo Wang, Yuxin Peng

TL;DR

TiFRe tackles the efficiency bottleneck in Video MLLMs by introducing a text-guided frame reduction pipeline that selects semantically relevant key frames conditioned on the user prompt and merges information from non-key frames to preserve context. The framework comprises Text-guided Frame Sampling (TFS), which uses an LLM to generate CLIP-style prompts and CLIP-based frame scoring, and Frame Matching and Merging (FMM), which semantically fuses non-key frames into key frames through weighted averaging. Empirical results on VNBench and MLVU show that TiFRe reduces the number of input frames (e.g., from 55.2 to 8.6) while achieving higher accuracy than fixed-FPS baselines and existing frame-reduction methods, demonstrating both improved efficiency and performance across multiple LLM backbones. Overall, TiFRe offers a practical, prompt-responsive solution for scalable video understanding with potential for real-world deployment in long-form video QA and related tasks.

Abstract

With the rapid development of Large Language Models (LLMs), Video Multi-Modal Large Language Models (Video MLLMs) have achieved remarkable performance in video-language tasks such as video understanding and question answering. However, Video MLLMs face high computational costs, particularly in processing numerous video frames as input, which leads to significant attention computation overhead. A straightforward approach to reduce computational costs is to decrease the number of input video frames. However, simply selecting key frames at a fixed frame rate (FPS) often overlooks valuable information in non-key frames, resulting in notable performance degradation. To address this, we propose Text-guided Video Frame Reduction (TiFRe), a framework that reduces input frames while preserving essential video information. TiFRe uses a Text-guided Frame Sampling (TFS) strategy to select key frames based on user input, which is processed by an LLM to generate a CLIP-style prompt. Pre-trained CLIP encoders calculate the semantic similarity between the prompt and each frame, selecting the most relevant frames as key frames. To preserve video semantics, TiFRe employs a Frame Matching and Merging (FMM) mechanism, which integrates non-key frame information into the selected key frames, minimizing information loss. Experiments show that TiFRe effectively reduces computational costs while improving performance on video-language tasks.

TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models

TL;DR

TiFRe tackles the efficiency bottleneck in Video MLLMs by introducing a text-guided frame reduction pipeline that selects semantically relevant key frames conditioned on the user prompt and merges information from non-key frames to preserve context. The framework comprises Text-guided Frame Sampling (TFS), which uses an LLM to generate CLIP-style prompts and CLIP-based frame scoring, and Frame Matching and Merging (FMM), which semantically fuses non-key frames into key frames through weighted averaging. Empirical results on VNBench and MLVU show that TiFRe reduces the number of input frames (e.g., from 55.2 to 8.6) while achieving higher accuracy than fixed-FPS baselines and existing frame-reduction methods, demonstrating both improved efficiency and performance across multiple LLM backbones. Overall, TiFRe offers a practical, prompt-responsive solution for scalable video understanding with potential for real-world deployment in long-form video QA and related tasks.

Abstract

With the rapid development of Large Language Models (LLMs), Video Multi-Modal Large Language Models (Video MLLMs) have achieved remarkable performance in video-language tasks such as video understanding and question answering. However, Video MLLMs face high computational costs, particularly in processing numerous video frames as input, which leads to significant attention computation overhead. A straightforward approach to reduce computational costs is to decrease the number of input video frames. However, simply selecting key frames at a fixed frame rate (FPS) often overlooks valuable information in non-key frames, resulting in notable performance degradation. To address this, we propose Text-guided Video Frame Reduction (TiFRe), a framework that reduces input frames while preserving essential video information. TiFRe uses a Text-guided Frame Sampling (TFS) strategy to select key frames based on user input, which is processed by an LLM to generate a CLIP-style prompt. Pre-trained CLIP encoders calculate the semantic similarity between the prompt and each frame, selecting the most relevant frames as key frames. To preserve video semantics, TiFRe employs a Frame Matching and Merging (FMM) mechanism, which integrates non-key frame information into the selected key frames, minimizing information loss. Experiments show that TiFRe effectively reduces computational costs while improving performance on video-language tasks.
Paper Structure (16 sections, 10 equations, 4 figures, 4 tables)

This paper contains 16 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The performance comparison between our proposed method TiFRe and State-of-the-Art Video MLLM Video-XL-7B shu2024video and Video-LLaVA lin2023video on VNBench. Compared with Video-XL, TiFRe achieves a significant reduction in input frames while obtaining a higher performance. Compared with Video-LLaVA, TiFRe achieves a substantial performance improvement with the same number of input frames.
  • Figure 2: This illustration compares two different frame selection methods used in Video MLLMs. On the left, the Fixed-FPS Key Frame Selection method samples a fixed number of frames evenly across the video, which may result in losing semantics and redundant tokens. On the right, the Text-guided Frame Reduction (TiFRe) selects frames with significant semantic information, especially those relevant to the text input. The results indicate that TiFRe's frame selection strategy better maintains the video information and eliminates redundant frames, leading to a more accurate and effective response.
  • Figure 3: The framework of our proposed Text-guided Video Frame Reduction (TiFRe).
  • Figure 4: Examples of key frame selection. For each example, the first row is the raw video, where ground-truth key frames are highlighted with yellow boxes. The second row is key frames selected by Video-LLaVA and the third row is by TiFRe (ours), where key frames are highlighted with green boxes and other non-key frames are with red boxes.