Table of Contents
Fetching ...

GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding

Junpeng Ma, Sashuai Zhou, Guanghao Li, Xin Gao, Yue Cao, Hengyu Zeng, Yuxiang Yan, Zhibin Wang, Jun Song, Bo Zheng, Shanghang Zhang, Jian Pu

Abstract

Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose GIFT: Global Irreplaceability Frame Targeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame's uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling.

GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding

Abstract

Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose GIFT: Global Irreplaceability Frame Targeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame's uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling.

Paper Structure

This paper contains 15 sections, 5 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Average accuracy of LLaVA-Video-7B on Video-MME, LongVideoBench and MLVU for different methods: uniform sampling, BOLT liu2025bolt, AKS tang2025aks, and our GIFT.
  • Figure 2: The overall framework of our GIFT. Given an input video and user query, GIFT first calculates each frame's query-relevance($r$) and directed diversity($d$) to quantify its irreplaceability(detailed in the left panel). The core of GIFT is a two-stage selection process (detailed in the right panel). Stage 1 performs the initial selection by identifying frames with high scores ($s=r \times d$). If the sampling budget ($K$) exceeds batch-size ($B$), the Budget-Aware Refinement in Stage 2 is triggered. This stage iteratively Selects a batch of $B$ frames, Removes them, and Updates the $d$ for all remaining frames until the budget is met. This iterative update process continuously Releases previously suppressed frames, progressively building a rich temporal context.
  • Figure 3: Ablation study of GIFT's modules on LLaVA-Video (Frame Budget:32). The y-axis shows performance relative to Uniform Sampling (100%). "Standard Diversity" refers to using the decoupled evaluation criteria with standard diversity instead of our proposed directed diversity. "Without BAR" refers to disabling the Budget-Aware Refinement module.