Table of Contents
Fetching ...

Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering

Minchan Kwon, Hyounguk Shon, Junmo Kim

Abstract

Large multimodal models (LMMs) have recently demonstrated remarkable performance in video question answering (VideoQA), yet reasoning over video remains challenging due to high inference cost and diluted information. Keyframe selection offers efficiency and sharper reasoning but suffers from sparse supervision and redundant frame choices when relying only on image-text similarity. We present a question-aware keyframe selection framework with two components: pseudo keyframe labels derived from LMMs that provide informative supervision and a coverage regularization that promotes diverse, complementary evidence across time. Experiments on NExT-QA show that our method significantly improves accuracy, especially for temporal and causal question types, establishing keyframe selection as an effective and learnable module for VideoQA.

Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering

Abstract

Large multimodal models (LMMs) have recently demonstrated remarkable performance in video question answering (VideoQA), yet reasoning over video remains challenging due to high inference cost and diluted information. Keyframe selection offers efficiency and sharper reasoning but suffers from sparse supervision and redundant frame choices when relying only on image-text similarity. We present a question-aware keyframe selection framework with two components: pseudo keyframe labels derived from LMMs that provide informative supervision and a coverage regularization that promotes diverse, complementary evidence across time. Experiments on NExT-QA show that our method significantly improves accuracy, especially for temporal and causal question types, establishing keyframe selection as an effective and learnable module for VideoQA.
Paper Structure (11 sections, 4 equations, 2 figures, 2 tables)

This paper contains 11 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: An overview of our proposed question-aware keyframe selection framework. The model processes the video with an image encoder and the question with a text encoder. The question embedding is then used in two parallel streams: (1) it is fed into a Gaussian Generator to create an initial keyframe distribution, and (2) it is used to generate synthetic keyframes via CLIP similarity and VLM ranking, which serve as weak supervision. To adapt the selection process, the LMM's own backbone is re-purposed with a prompt to determine the question type. This information guides the Question-Conditioned Coverage Regularization (QCCR) module, which refines the initial distribution. Finally, the predicted keyframes are passed to the downstream LMM to generate the final answer.
  • Figure 2: A qualitative example illustrating the effectiveness of VLM Ranking over CLIP similarity. For the question "What does the child do after ending the call?", frames with high CLIP scores are often redundant while the VLM Ranking correctly identifies the temporally relevant frames that occur after the call has ended.