Table of Contents
Fetching ...

GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering

Dionysia Danai Brilli, Dimitrios Mallis, Vassilis Pitsikalis, Petros Maragos

TL;DR

GHR-VQA tackles VideoQA by enforcing explicit, human-centered relational reasoning through a video-level graph anchored on human actors. It converts each frame into a scene graph, links frame graphs via a shared root, and encodes them with a 2-layer HetEdgeGAT, followed by a hierarchical CRN-based reasoning network conditioned on the question. The method achieves strong performance on AGQA, notably improving object-relational reasoning, and offers interpretable intermediate representations due to explicit human-object graphs. This approach reduces reliance on raw pixel features and demonstrates potential for efficient, explainable video understanding in constrained hardware and multi-human scenarios.

Abstract

We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.

GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering

TL;DR

GHR-VQA tackles VideoQA by enforcing explicit, human-centered relational reasoning through a video-level graph anchored on human actors. It converts each frame into a scene graph, links frame graphs via a shared root, and encodes them with a 2-layer HetEdgeGAT, followed by a hierarchical CRN-based reasoning network conditioned on the question. The method achieves strong performance on AGQA, notably improving object-relational reasoning, and offers interpretable intermediate representations due to explicit human-object graphs. This approach reduces reliance on raw pixel features and demonstrates potential for efficient, explainable video understanding in constrained hardware and multi-human scenarios.

Abstract

We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.

Paper Structure

This paper contains 23 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Our proposed architecture. The process begins with the input of a question and a corresponding video. Initially, we perform clip selection and pass the segments through an SGG model to extract scene graphs that represent the visual elements and their interrelationships. These extracted scene graphs are processed by a GNN, which generates meaningful embeddings. The embeddings are then fed into a hierarchical network, which integrates and contextualizes the information across different levels of abstraction to generate a comprehensive understanding in relation to the query and finally answer the question.
  • Figure 2: Example of 4 frames from the video sample OCGMQ with the corresponding annotated scene graphs.