HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren, Alper Yilmaz, Khoa Luu
TL;DR
HyperGLM addresses the limitations of traditional Video Scene Graph Generation by unifying spatial object graphs with a procedural temporal graph into a Video Scene HyperGraph, enabling higher-order reasoning across frames. A multimodal LLM is injected with this HyperGraph to perform SGG, SGA, VQA, VC, and RR, achieving state-of-the-art performance across five tasks. The authors also release the Video Scene Graph Reasoning (VSGR) dataset with 1.9M frames from multiple viewpoints to benchmark complex reasoning in diverse video contexts. Empirically, HyperGLM demonstrates superior modeling of complex interactions in varied scenes, highlighting the practical impact of integrating structured hypergraphs with LLM-based reasoning for video understanding.
Abstract
Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.
