Table of Contents
Fetching ...

HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group Activity Scene Graph Generation in Videos

Naga VS Raviteja Chappa, Pha Nguyen, Thi Hoang Ngan Le, Khoa Luu

TL;DR

This work introduced a GASG dataset extending the JRDB dataset with nuanced annotations involving appearance, interaction, position, relationship, and situation attributes, and introduced an innovative approach, a Hierarchical Attention–Flow (HAtt-Flow) mechanism, rooted in flow network theory to enhance GASG performance.

Abstract

Group Activity Scene Graph (GASG) generation is a challenging task in computer vision, aiming to anticipate and describe relationships between subjects and objects in video sequences. Traditional Video Scene Graph Generation (VidSGG) methods focus on retrospective analysis, limiting their predictive capabilities. To enrich the scene understanding capabilities, we introduced a GASG dataset extending the JRDB dataset with nuanced annotations involving \textit{Appearance, Interaction, Position, Relationship, and Situation} attributes. This work also introduces an innovative approach, \textbf{H}ierarchical \textbf{Att}ention-\textbf{Flow} (HAtt-Flow) Mechanism, rooted in flow network theory to enhance GASG performance. Flow-Attention incorporates flow conservation principles, fostering competition for sources and allocation for sinks, effectively preventing the generation of trivial attention. Our proposed approach offers a unique perspective on attention mechanisms, where conventional "values" and "keys" are transformed into sources and sinks, respectively, creating a novel framework for attention-based models. Through extensive experiments, we demonstrate the effectiveness of our Hatt-Flow model and the superiority of our proposed Flow-Attention mechanism. This work represents a significant advancement in predictive video scene understanding, providing valuable insights and techniques for applications that require real-time relationship prediction in video data.

HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group Activity Scene Graph Generation in Videos

TL;DR

This work introduced a GASG dataset extending the JRDB dataset with nuanced annotations involving appearance, interaction, position, relationship, and situation attributes, and introduced an innovative approach, a Hierarchical Attention–Flow (HAtt-Flow) mechanism, rooted in flow network theory to enhance GASG performance.

Abstract

Group Activity Scene Graph (GASG) generation is a challenging task in computer vision, aiming to anticipate and describe relationships between subjects and objects in video sequences. Traditional Video Scene Graph Generation (VidSGG) methods focus on retrospective analysis, limiting their predictive capabilities. To enrich the scene understanding capabilities, we introduced a GASG dataset extending the JRDB dataset with nuanced annotations involving \textit{Appearance, Interaction, Position, Relationship, and Situation} attributes. This work also introduces an innovative approach, \textbf{H}ierarchical \textbf{Att}ention-\textbf{Flow} (HAtt-Flow) Mechanism, rooted in flow network theory to enhance GASG performance. Flow-Attention incorporates flow conservation principles, fostering competition for sources and allocation for sinks, effectively preventing the generation of trivial attention. Our proposed approach offers a unique perspective on attention mechanisms, where conventional "values" and "keys" are transformed into sources and sinks, respectively, creating a novel framework for attention-based models. Through extensive experiments, we demonstrate the effectiveness of our Hatt-Flow model and the superiority of our proposed Flow-Attention mechanism. This work represents a significant advancement in predictive video scene understanding, providing valuable insights and techniques for applications that require real-time relationship prediction in video data.
Paper Structure (16 sections, 12 equations, 5 figures, 6 tables)

This paper contains 16 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: A sample video from our Group Activity Scene Graph (GASG) Dataset. The top row displays keyframes featuring overlaid bounding boxes, each annotated with a unique ID for consistency. Below, the timeline tubes provide a comprehensive temporal representation of scene graph annotations for distinct attributes, including Appearance, Interaction, Position, Relationship, and Situation. These annotations offer nuanced details, enhancing scene understanding and contributing to a more refined video content analysis. Best viewed in color and zoomed.
  • Figure 2: Comparison of HAtt-Flow result with other Scene Graph Generation methods. Best viewed in color and zoomed.
  • Figure 3: Statistics of the GASG dataset, number of social groups, and the attributes in the dataset. Best viewed in color and zoomed.
  • Figure 4: Overall architecture of the proposed HAtt-Flow network. The extracted visual and textual features are passed through their respective graph transformers to obtain corresponding node features. These nodes are passed through the hierarchy-aware-based transformer encoder models to have enriched features, including a feature flow attention mechanism to enhance cross-modality learning. Finally, we use CLIP loss to optimize the learned features. Please refer to \ref{['fig:motivation']} for the details of levels $\mathbf{L}_{0}$, $\mathbf{L}_{1}$, $\mathbf{L}_{2}$ and $\mathbf{L}_{3}$.
  • Figure 5: The visualization of the scene graphs generated by PSGFormer yang2022panoptic and Ours. We can observe that yang2022panoptic could only detect the subjects but not accurate groups and their interactions. In contrast, the HAtt-Flow is accurate in graph generation and overall group activity prediction. Best viewed in color and zoomed.