End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling
Jianxin Liang, Xiaojun Meng, Yueqian Wang, Chang Liu, Qun Liu, Dongyan Zhao
TL;DR
VidF4 tackles the challenge of VideoQA by proposing an end-to-end frame-selection framework that learns to identify informative moments conditioned on the question. It introduces three frame-scoring mechanisms—Question-Frame Similarity, Question-Frame Matching, and Inter-Frame Distinctiveness—coupled with a differentiable adaptive frame sampler based on Weighted Reservoir Sampling and RelaxedTopK, enabling gradients to flow from the answer generator back to the frame selector. The system integrates a cross-modal encoder (Q-former) and a large language model to generate answers from a compact, relevant subset of frames, achieving state-of-the-art results on NExT-QA, STAR, and TVQA. Extensive ablations validate each scoring component and analyze training/inference frame-budget trade-offs, highlighting practical guidance for deploying VidF4 under computational constraints. Overall, the work demonstrates that end-to-end optimization of frame selection and answer generation yields improved accuracy and efficiency for VideoQA in real-world benchmarks.
Abstract
Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of multimedia processing, requiring intricate interactions between visual and textual modalities. Simply uniformly sampling frames or indiscriminately aggregating frame-level visual features often falls short in capturing the nuanced and relevant contexts of videos to well perform VideoQA. To mitigate these issues, we propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA. We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video. Furthermore, we design a differentiable adaptive frame sampling mechanism to facilitate end-to-end training for the frame selector and answer generator. The experimental results across three widely adopted benchmarks demonstrate that our model consistently outperforms existing VideoQA methods, establishing a new SOTA across NExT-QA (+0.3%), STAR (+0.9%), and TVQA (+1.0%). Furthermore, through both quantitative and qualitative analyses, we validate the effectiveness of each design choice.
