Table of Contents
Fetching ...

HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding

Trong-Thuan Nguyen, Pha Nguyen, Khoa Luu

TL;DR

This work tackles Visual Interactivity Understanding in video by introducing ASPIRe, a richly annotated dataset with five interactivity descriptors, and the Hierarchical Interlacement Graph (HIG), a unified hierarchical GNN framework to model time-evolving interlacements between subjects. The method combines multi-level graph representations with a message-passing scheme and level-wise Focal Loss training, yielding interactivity predictions for all subject pairs via $I(S_i,S_j)=\mathcal{C}(m^{(L)}_1(S_i,S_j),\mathcal{F}^{(L)}_1(S_i))$ and feature updates such as $\mathcal{F}^{(l)}_t(S_i)=\sum_{S_j\in\mathcal{N}(S_i)}\mathcal{F}^{(l-1)}_t(S_j)$. Key contributions include the ASPIRe dataset (57K+ subjects across seven sources, 1.488 videos, 4.549 interactivities) and the HIG framework with hierarchical weight sharing and sequential unfreezing, achieving state-of-the-art results on ASPIRe and competitive performance on PSG, validating both dataset quality and model efficacy. The work advances practical scene understanding in videos by enabling detailed, temporally-aware interactivity reasoning across five descriptor types, with potential impact on tasks like video reasoning, grounding, and human-robot interaction.

Abstract

Visual interactivity understanding within visual scenes presents a significant challenge in computer vision. Existing methods focus on complex interactivities while leveraging a simple relationship model. These methods, however, struggle with a diversity of appearance, situation, position, interaction, and relation in videos. This limitation hinders the ability to fully comprehend the interplay within the complex visual dynamics of subjects. In this paper, we delve into interactivities understanding within visual content by deriving scene graph representations from dense interactivities among humans and objects. To achieve this goal, we first present a new dataset containing Appearance-Situation-Position-Interaction-Relation predicates, named ASPIRe, offering an extensive collection of videos marked by a wide range of interactivities. Then, we propose a new approach named Hierarchical Interlacement Graph (HIG), which leverages a unified layer and graph within a hierarchical structure to provide deep insights into scene changes across five distinct tasks. Our approach demonstrates superior performance to other methods through extensive experiments conducted in various scenarios.

HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding

TL;DR

This work tackles Visual Interactivity Understanding in video by introducing ASPIRe, a richly annotated dataset with five interactivity descriptors, and the Hierarchical Interlacement Graph (HIG), a unified hierarchical GNN framework to model time-evolving interlacements between subjects. The method combines multi-level graph representations with a message-passing scheme and level-wise Focal Loss training, yielding interactivity predictions for all subject pairs via and feature updates such as . Key contributions include the ASPIRe dataset (57K+ subjects across seven sources, 1.488 videos, 4.549 interactivities) and the HIG framework with hierarchical weight sharing and sequential unfreezing, achieving state-of-the-art results on ASPIRe and competitive performance on PSG, validating both dataset quality and model efficacy. The work advances practical scene understanding in videos by enabling detailed, temporally-aware interactivity reasoning across five descriptor types, with potential impact on tasks like video reasoning, grounding, and human-robot interaction.

Abstract

Visual interactivity understanding within visual scenes presents a significant challenge in computer vision. Existing methods focus on complex interactivities while leveraging a simple relationship model. These methods, however, struggle with a diversity of appearance, situation, position, interaction, and relation in videos. This limitation hinders the ability to fully comprehend the interplay within the complex visual dynamics of subjects. In this paper, we delve into interactivities understanding within visual content by deriving scene graph representations from dense interactivities among humans and objects. To achieve this goal, we first present a new dataset containing Appearance-Situation-Position-Interaction-Relation predicates, named ASPIRe, offering an extensive collection of videos marked by a wide range of interactivities. Then, we propose a new approach named Hierarchical Interlacement Graph (HIG), which leverages a unified layer and graph within a hierarchical structure to provide deep insights into scene changes across five distinct tasks. Our approach demonstrates superior performance to other methods through extensive experiments conducted in various scenarios.
Paper Structure (19 sections, 8 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 19 sections, 8 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: An example from our ASPIRe dataset for Visual Interactivity Understanding. The top row shows keyframes with the bounding boxes. Appearance, Situation, Position, Interaction, and Relation are attributes presented in the dataset. Best viewed in color.
  • Figure 2: Example and annotations in our ASPIRe dataset. Best viewed in color and zoom in.
  • Figure 3: Statistics from the proposed ASPIRe dataset.
  • Figure 4: The terminologies used in our proposed ASPIRe dataset and Hierarchical Interlacement Graph.
  • Figure 5: Our proposed Hierarchical Interlacement Graph. The highlighted attributes denote the temporal changes in the graph. Then, all predicted interactivities are accumulated into the next hierarchy level. A higher-level graph cell covers a bigger portion of video frames.
  • ...and 1 more figures