Table of Contents
Fetching ...

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren, Alper Yilmaz, Khoa Luu

TL;DR

HyperGLM addresses the limitations of traditional Video Scene Graph Generation by unifying spatial object graphs with a procedural temporal graph into a Video Scene HyperGraph, enabling higher-order reasoning across frames. A multimodal LLM is injected with this HyperGraph to perform SGG, SGA, VQA, VC, and RR, achieving state-of-the-art performance across five tasks. The authors also release the Video Scene Graph Reasoning (VSGR) dataset with 1.9M frames from multiple viewpoints to benchmark complex reasoning in diverse video contexts. Empirically, HyperGLM demonstrates superior modeling of complex interactions in varied scenes, highlighting the practical impact of integrating structured hypergraphs with LLM-based reasoning for video understanding.

Abstract

Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

TL;DR

HyperGLM addresses the limitations of traditional Video Scene Graph Generation by unifying spatial object graphs with a procedural temporal graph into a Video Scene HyperGraph, enabling higher-order reasoning across frames. A multimodal LLM is injected with this HyperGraph to perform SGG, SGA, VQA, VC, and RR, achieving state-of-the-art performance across five tasks. The authors also release the Video Scene Graph Reasoning (VSGR) dataset with 1.9M frames from multiple viewpoints to benchmark complex reasoning in diverse video contexts. Empirically, HyperGLM demonstrates superior modeling of complex interactions in varied scenes, highlighting the practical impact of integrating structured hypergraphs with LLM-based reasoning for video understanding.

Abstract

Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.

Paper Structure

This paper contains 17 sections, 7 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Our HyperGLM framework supports Video Scene Graph Generation, Anticipation, and Reasoning. HyperGLM constructs scene graphs from observed video frames and predicts relationships in unseen frames by leveraging a unified hypergraph for temporal modeling and comprehensive understanding.
  • Figure 2: (a) To model the temporal transition, a simple approach can be using two scene graphs $\mathbf{G_{t}}$ and $\mathbf{G_{t + 1}}$. (b) Another procedure graph can present this temporal modeling. (c) Our unified HyperGraph in Fig. \ref{['fig:compare']}c integrates both entity scene graph to capture spatial relationships and the procedural graph to model the temporal evolution. HyperEdge represents personsitting oncouch, holding, thenplayingguitar, whereas holding $\rightarrow$ playing describes a chain of interactions. HyperGraph is presented in 3D.
  • Figure 3: Our HyperGLM framework comprises an image encoder, MLP projector, temporal aggregator, unified HyperGraph, and language model. It processes video frames by encoding each frame with the image encoder and MLP, extracting spatio-temporal features through image patch grids to generate $N$ spatial tokens per frame. The temporal aggregator compresses the $T \times N$ embeddings over time. The MLP projector then transforms these visual embeddings into the language feature space as frame tokens, interleaved with language tokens, and fed into the Large Language Models.
  • Figure 4: Our Video Scene HyperGraph, including entity graphs and a procedural graph, as defined in \ref{['sec:vsh']}. Blue nodes represent entities, while green nodes denote relationships. The entity graph captures spatial relationships (subject$\multimap$relationship$\multimap$object), whereas the procedural graph models relationship transitions ($\rightarrow$). Hyperedges are visualized as polygons, encapsulating interactions through chains of relationships. For instance, a hyperedge illustrates a person picking up, holding, opening, and reading a book while sitting on a couch. HyperGraph is presented in 3D, see Supplementary video.
  • Figure 5: An example of the diversified context within the streaming dialog in our VSGR dataset. Best viewed in color and zooming in.
  • ...and 3 more figures