Table of Contents
Fetching ...

Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding

Yiheng Wang, Lichen Zhu, Yueqian Lin, Yudong Liu, Jingyang Zhang, Hai "Helen" Li, Yiran Chen

Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory. We formulate keyframe selection as maximizing the conditional mutual information between selected frames and the query, providing a principled objective that reflects each frame's contribution to answering the question. To make this objective tractable, we exploit its structure to derive a decomposed optimization that reduces subset selection to independent frame-level scoring. We further introduce a query-conditioned evidence scoring network trained with a contrastive objective to estimate evidential importance efficiently. Experiments on long-form video understanding benchmarks show that our method consistently outperforms prior sampling strategies under strict token budgets, while significantly improving training efficiency.

Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding

Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory. We formulate keyframe selection as maximizing the conditional mutual information between selected frames and the query, providing a principled objective that reflects each frame's contribution to answering the question. To make this objective tractable, we exploit its structure to derive a decomposed optimization that reduces subset selection to independent frame-level scoring. We further introduce a query-conditioned evidence scoring network trained with a contrastive objective to estimate evidential importance efficiently. Experiments on long-form video understanding benchmarks show that our method consistently outperforms prior sampling strategies under strict token budgets, while significantly improving training efficiency.

Paper Structure

This paper contains 23 sections, 13 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Semantic relevance vs. evidential usefulness in keyframe selection. Frames with the highest semantic similarity (blue) predicted by CLIP clip, describing poem content (left) or general match scenes (right), are not always informative for answering the query. In contrast, frames with the highest evidence scores (red) predicted by our method better capture answer-critical information. The green regions indicate the temporal segments containing the key evidence needed for answering.
  • Figure 2: Overview of the query-conditioned evidence scoring network. The input video is uniformly sampled into frames, which are encoded into frame embeddings and scored to obtain frame-level evidence scores conditioned on the query. Frames with the highest evidence scores are then selected and fed into an MLLM to generate the final response.
  • Figure S1: Comparison between uniform sampling and our keyframe sampling on long-form video examples. Red boxes highlight the query-relevant frames containing the critical visual evidence. Our method successfully captures the query-relevant frames, enabling Qwen2-VL-7B to identify the correct visual evidence and produce the correct answer, whereas uniform sampling misses these critical moments.
  • Figure S2: Comparison between uniform sampling and our keyframe sampling. Red boxes indicate the key evidence frames required to answer the query. Our method selects these informative moments, allowing Qwen2.5-VL-7B to answer correctly, while uniform sampling overlooks them.
  • Figure S3: Comparison between uniform sampling and our keyframe sampling. Red boxes denote the critical frames that contain the necessary visual cues. By capturing these brief but informative moments, our method enables LLaVA-Video-7B to arrive at the correct answer, whereas uniform sampling fails.
  • ...and 1 more figures