Table of Contents
Fetching ...

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

Jianxin Liang, Xiaojun Meng, Yueqian Wang, Chang Liu, Qun Liu, Dongyan Zhao

TL;DR

VidF4 tackles the challenge of VideoQA by proposing an end-to-end frame-selection framework that learns to identify informative moments conditioned on the question. It introduces three frame-scoring mechanisms—Question-Frame Similarity, Question-Frame Matching, and Inter-Frame Distinctiveness—coupled with a differentiable adaptive frame sampler based on Weighted Reservoir Sampling and RelaxedTopK, enabling gradients to flow from the answer generator back to the frame selector. The system integrates a cross-modal encoder (Q-former) and a large language model to generate answers from a compact, relevant subset of frames, achieving state-of-the-art results on NExT-QA, STAR, and TVQA. Extensive ablations validate each scoring component and analyze training/inference frame-budget trade-offs, highlighting practical guidance for deploying VidF4 under computational constraints. Overall, the work demonstrates that end-to-end optimization of frame selection and answer generation yields improved accuracy and efficiency for VideoQA in real-world benchmarks.

Abstract

Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of multimedia processing, requiring intricate interactions between visual and textual modalities. Simply uniformly sampling frames or indiscriminately aggregating frame-level visual features often falls short in capturing the nuanced and relevant contexts of videos to well perform VideoQA. To mitigate these issues, we propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA. We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video. Furthermore, we design a differentiable adaptive frame sampling mechanism to facilitate end-to-end training for the frame selector and answer generator. The experimental results across three widely adopted benchmarks demonstrate that our model consistently outperforms existing VideoQA methods, establishing a new SOTA across NExT-QA (+0.3%), STAR (+0.9%), and TVQA (+1.0%). Furthermore, through both quantitative and qualitative analyses, we validate the effectiveness of each design choice.

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

TL;DR

VidF4 tackles the challenge of VideoQA by proposing an end-to-end frame-selection framework that learns to identify informative moments conditioned on the question. It introduces three frame-scoring mechanisms—Question-Frame Similarity, Question-Frame Matching, and Inter-Frame Distinctiveness—coupled with a differentiable adaptive frame sampler based on Weighted Reservoir Sampling and RelaxedTopK, enabling gradients to flow from the answer generator back to the frame selector. The system integrates a cross-modal encoder (Q-former) and a large language model to generate answers from a compact, relevant subset of frames, achieving state-of-the-art results on NExT-QA, STAR, and TVQA. Extensive ablations validate each scoring component and analyze training/inference frame-budget trade-offs, highlighting practical guidance for deploying VidF4 under computational constraints. Overall, the work demonstrates that end-to-end optimization of frame selection and answer generation yields improved accuracy and efficiency for VideoQA in real-world benchmarks.

Abstract

Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of multimedia processing, requiring intricate interactions between visual and textual modalities. Simply uniformly sampling frames or indiscriminately aggregating frame-level visual features often falls short in capturing the nuanced and relevant contexts of videos to well perform VideoQA. To mitigate these issues, we propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA. We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video. Furthermore, we design a differentiable adaptive frame sampling mechanism to facilitate end-to-end training for the frame selector and answer generator. The experimental results across three widely adopted benchmarks demonstrate that our model consistently outperforms existing VideoQA methods, establishing a new SOTA across NExT-QA (+0.3%), STAR (+0.9%), and TVQA (+1.0%). Furthermore, through both quantitative and qualitative analyses, we validate the effectiveness of each design choice.
Paper Structure (24 sections, 9 equations, 4 figures, 7 tables)

This paper contains 24 sections, 9 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Dataflow of VideoQA models. The width of the arrow represents the amount of video frames.
  • Figure 2: The overall framework, VidF4, consists of two main components: Answer Generator and Frame Selector. Within the Frame Selector, three scoring mechanisms—QFS, QFM, and IFD—are utilized to evaluate the relevance of frames to the given question. Additionally, there's a differentiable sampler that uses the final score of each frame as the weight for sampling. This adaptive sampler enables VidF4 to be trained in an end-to-end manner.
  • Figure 3: Model performance and FLOPs comparison on STAR using different numbers of frames during inference, where the same color represents the same model checkpoint. For example, the brown stars denote VidF4(8), which corresponds to the model trained at $k = 8$, and evaluated under various $k'$ during inference ($k'=\{8,16,24,32\}$).
  • Figure 4: Case study of our VidF4 using 8 Frames. All selected Frames are arranged sequentially from left to right and top to bottom. VidF4 w/o QFM ignores the relevant frames to answer Q1. VidF4 w/o IFD only selects the most relevant but redundant frames, thus unable to answer questions Q2 to Q5. VidF4 successfully identifies the required frames (in green boxes) to answer these cases which are sourced from the test set of STAR.