GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering
Dionysia Danai Brilli, Dimitrios Mallis, Vassilis Pitsikalis, Petros Maragos
TL;DR
GHR-VQA tackles VideoQA by enforcing explicit, human-centered relational reasoning through a video-level graph anchored on human actors. It converts each frame into a scene graph, links frame graphs via a shared root, and encodes them with a 2-layer HetEdgeGAT, followed by a hierarchical CRN-based reasoning network conditioned on the question. The method achieves strong performance on AGQA, notably improving object-relational reasoning, and offers interpretable intermediate representations due to explicit human-object graphs. This approach reduces reliance on raw pixel features and demonstrates potential for efficient, explainable video understanding in constrained hardware and multi-human scenarios.
Abstract
We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.
