Table of Contents
Fetching ...

From Category to Scenery: An End-to-End Framework for Multi-Person Human-Object Interaction Recognition in Videos

Tanqiu Qiao, Ruochen Li, Frederick W. B. Li, Hubert P. H. Shum

TL;DR

This work addresses video-based HOI recognition by introducing CATS, a category-to-scenery framework that first learns category-specific geometric features and fuses them with visual cues, then models inter-category HOIs with a scenery interactive graph built via a Graph Attention Network and temporal dependencies via Bi-GRUs. The approach emphasizes category-aware feature learning and hierarchical fusion to reduce misalignment between geometry and visuals, enabling more accurate recognition of multi-person and single-person HOIs. CATS achieves state-of-the-art results on MPHOI-72 and CAD-120, demonstrating improved segmentation and labeling of sub-activities and robust cross-dataset performance. The model's combination of multi-category multi-modality fusion, attention-based graph reasoning, and temporal sub-event modeling provides a principled, scalable framework with practical impact for advanced video understanding and human-object interaction analysis.

Abstract

Video-based Human-Object Interaction (HOI) recognition explores the intricate dynamics between humans and objects, which are essential for a comprehensive understanding of human behavior and intentions. While previous work has made significant strides, effectively integrating geometric and visual features to model dynamic relationships between humans and objects in a graph framework remains a challenge. In this work, we propose a novel end-to-end category to scenery framework, CATS, starting by generating geometric features for various categories through graphs respectively, then fusing them with corresponding visual features. Subsequently, we construct a scenery interactive graph with these enhanced geometric-visual features as nodes to learn the relationships among human and object categories. This methodological advance facilitates a deeper, more structured comprehension of interactions, bridging category-specific insights with broad scenery dynamics. Our method demonstrates state-of-the-art performance on two pivotal HOI benchmarks, including the MPHOI-72 dataset for multi-person HOIs and the single-person HOI CAD-120 dataset.

From Category to Scenery: An End-to-End Framework for Multi-Person Human-Object Interaction Recognition in Videos

TL;DR

This work addresses video-based HOI recognition by introducing CATS, a category-to-scenery framework that first learns category-specific geometric features and fuses them with visual cues, then models inter-category HOIs with a scenery interactive graph built via a Graph Attention Network and temporal dependencies via Bi-GRUs. The approach emphasizes category-aware feature learning and hierarchical fusion to reduce misalignment between geometry and visuals, enabling more accurate recognition of multi-person and single-person HOIs. CATS achieves state-of-the-art results on MPHOI-72 and CAD-120, demonstrating improved segmentation and labeling of sub-activities and robust cross-dataset performance. The model's combination of multi-category multi-modality fusion, attention-based graph reasoning, and temporal sub-event modeling provides a principled, scalable framework with practical impact for advanced video understanding and human-object interaction analysis.

Abstract

Video-based Human-Object Interaction (HOI) recognition explores the intricate dynamics between humans and objects, which are essential for a comprehensive understanding of human behavior and intentions. While previous work has made significant strides, effectively integrating geometric and visual features to model dynamic relationships between humans and objects in a graph framework remains a challenge. In this work, we propose a novel end-to-end category to scenery framework, CATS, starting by generating geometric features for various categories through graphs respectively, then fusing them with corresponding visual features. Subsequently, we construct a scenery interactive graph with these enhanced geometric-visual features as nodes to learn the relationships among human and object categories. This methodological advance facilitates a deeper, more structured comprehension of interactions, bridging category-specific insights with broad scenery dynamics. Our method demonstrates state-of-the-art performance on two pivotal HOI benchmarks, including the MPHOI-72 dataset for multi-person HOIs and the single-person HOI CAD-120 dataset.
Paper Structure (24 sections, 5 equations, 6 figures, 5 tables)

This paper contains 24 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of our end-to-end framework $\textrm{CATS}$. We first learn geometric features via a graph for human and object categories, fusing them with corresponding visual features. Subsequently, a scenery interactive graph is constructed to deeply understand the interaction dynamics between multi-categories.
  • Figure 2: The process of learning and fusing geometric and visual features for human and object categories.
  • Figure 3: Visualization of segmentation on MPHOI-72 for Cheering activity. Red dashed boxes highlight major segmentation errors.
  • Figure 4: Visualization of segmentation on MPHOI-72 for Hair cutting activity. Red dashed boxes highlight major segmentation errors.
  • Figure 5: Visualization of segmentation on CAD-120 for Cleaning objects activity. Red dashed boxes highlight major segmentation errors.
  • ...and 1 more figures