Towards Fine-Grained Video Question Answering
Wei Dai, Alan Luo, Zane Durante, Debadutta Dash, Arnold Milstein, Kevin Schulman, Ehsan Adeli, Li Fei-Fei
TL;DR
This work addresses the need for fine-grained, temporally and spatially grounded VideoQA by introducing the MOMA-QA dataset, which provides frame-level scene graphs and temporal interval annotations to support precise localization and relational reasoning. It also presents SGVLM, a video-language model that integrates a Motif-based scene graph predictor, an efficient frame localizer, Q-Formers, and a pre-trained large language model to achieve strong zero-shot and fine-tuned performance, outperforming prior methods on MOMA-QA and several public datasets. The framework emphasizes entity-centric reasoning in crowded scenes, enabling detailed answers about relationships, motions, and descriptions, while offering interpretability through structured scene-graph grounding. Overall, the paper demonstrates that combining fine-grained annotations with graph-augmented grounding and LLM-based reasoning yields substantial improvements in VideoQA and long-form temporal localization, with potential impact on video understanding systems requiring precise spatio-temporal reasoning.
Abstract
In the rapidly evolving domain of video understanding, Video Question Answering (VideoQA) remains a focal point. However, existing datasets exhibit gaps in temporal and spatial granularity, which consequently limits the capabilities of existing VideoQA methods. This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset, which is designed to address these shortcomings by emphasizing temporal localization, spatial relationship reasoning, and entity-centric queries. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. Furthermore, we present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding. Evaluations on MOMA-QA and other public datasets demonstrate the superior performance of our model, setting new benchmarks for VideoQA.
