Table of Contents
Fetching ...

Towards Fine-Grained Video Question Answering

Wei Dai, Alan Luo, Zane Durante, Debadutta Dash, Arnold Milstein, Kevin Schulman, Ehsan Adeli, Li Fei-Fei

TL;DR

This work addresses the need for fine-grained, temporally and spatially grounded VideoQA by introducing the MOMA-QA dataset, which provides frame-level scene graphs and temporal interval annotations to support precise localization and relational reasoning. It also presents SGVLM, a video-language model that integrates a Motif-based scene graph predictor, an efficient frame localizer, Q-Formers, and a pre-trained large language model to achieve strong zero-shot and fine-tuned performance, outperforming prior methods on MOMA-QA and several public datasets. The framework emphasizes entity-centric reasoning in crowded scenes, enabling detailed answers about relationships, motions, and descriptions, while offering interpretability through structured scene-graph grounding. Overall, the paper demonstrates that combining fine-grained annotations with graph-augmented grounding and LLM-based reasoning yields substantial improvements in VideoQA and long-form temporal localization, with potential impact on video understanding systems requiring precise spatio-temporal reasoning.

Abstract

In the rapidly evolving domain of video understanding, Video Question Answering (VideoQA) remains a focal point. However, existing datasets exhibit gaps in temporal and spatial granularity, which consequently limits the capabilities of existing VideoQA methods. This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset, which is designed to address these shortcomings by emphasizing temporal localization, spatial relationship reasoning, and entity-centric queries. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. Furthermore, we present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding. Evaluations on MOMA-QA and other public datasets demonstrate the superior performance of our model, setting new benchmarks for VideoQA.

Towards Fine-Grained Video Question Answering

TL;DR

This work addresses the need for fine-grained, temporally and spatially grounded VideoQA by introducing the MOMA-QA dataset, which provides frame-level scene graphs and temporal interval annotations to support precise localization and relational reasoning. It also presents SGVLM, a video-language model that integrates a Motif-based scene graph predictor, an efficient frame localizer, Q-Formers, and a pre-trained large language model to achieve strong zero-shot and fine-tuned performance, outperforming prior methods on MOMA-QA and several public datasets. The framework emphasizes entity-centric reasoning in crowded scenes, enabling detailed answers about relationships, motions, and descriptions, while offering interpretability through structured scene-graph grounding. Overall, the paper demonstrates that combining fine-grained annotations with graph-augmented grounding and LLM-based reasoning yields substantial improvements in VideoQA and long-form temporal localization, with potential impact on video understanding systems requiring precise spatio-temporal reasoning.

Abstract

In the rapidly evolving domain of video understanding, Video Question Answering (VideoQA) remains a focal point. However, existing datasets exhibit gaps in temporal and spatial granularity, which consequently limits the capabilities of existing VideoQA methods. This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset, which is designed to address these shortcomings by emphasizing temporal localization, spatial relationship reasoning, and entity-centric queries. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. Furthermore, we present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding. Evaluations on MOMA-QA and other public datasets demonstrate the superior performance of our model, setting new benchmarks for VideoQA.

Paper Structure

This paper contains 24 sections, 16 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Visualizations of Sample Questions from MOMA-QA. We illustrate the three distinct types of questions in our dataset, each representing a different category for video question answering. All questions in our dataset are generated from a human-annotated spatio-temporal scene graph (shown on the right). The node of interest for the relationship and motion questions is colored red in the scene graph and outlined in the video.
  • Figure 2: Statistics of MOMA-QA. (a) The distribution of the number of actors. (b) The percentage of each question type in MOMA-QA. (c) The distribution of question lengths in MOMA-QA in words. (d) The percentage of box-augmented questions.
  • Figure 3: Model Architecture of SGVLM. The model employs a frame encoder to extract frame embeddings from the input video, which are subsequently used by a Scene Graph (SG) Predictor to generate scene graph embeddings. These embeddings are then concatenated with the frame features. The combination, along with question embeddings, is processed by a transformer encoder in the Frame Localizer to produce similarity scores for identifying relevant frames. Key frame features are then processed by Frame Q-Former and SG Q-Former to align with the language query and scene graph features. An LLM finally generates answers using a structured representation of scene graph and frame data, merged with the natural language question.
  • Figure 4: Self-Attention Mask of the Transformer Encoder in Frame Localizer. To separate frame and scene graph tokens, we mask out portions of the input with $-\infty$.
  • Figure 5: Visualization Results of SGVLM with Previous SoTA (SeViLA) on MOMA-QA. Left: An example where SGVLM makes the correct prediction while SeViLA fails. Right: An example where both our model and SeViLA produce incorrect answers. We magnify the part from the frame that is relevant to the question for better readability.