Table of Contents
Fetching ...

Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

Xiaoyang Liu, Boran Wen, Xinpeng Liu, Zizheng Zhou, Hongwei Fan, Cewu Lu, Lizhuang Ma, Yulong Chen, Yong-Lu Li

TL;DR

This work introduces Grounding Interacted Objects (GIO), a large open-world spatio-temporal HOI dataset built on AVA with 1,098 object classes and 290K object boxes, to study interacted-object grounding in third-view videos. It reframes object grounding as a 4D question-answering task (4D-QA) that uses SAM-generated proposals and a multi-modal 4D human-object layout to ground objects conditioned on a person’s interaction. The method demonstrates that fusing 2D context, 3D scene information, and HOI priors via a 4D-QA framework yields superior grounding performance over baselines, highlighting the importance of open-world object representations and temporal cues for HOI understanding. The dataset and approach collectively push toward deeper activity understanding and more robust open-world grounding for practical applications in surveillance, robotics, and human-centric AI.

Abstract

Spatio-temporal Human-Object Interaction (ST-HOI) understanding aims at detecting HOIs from videos, which is crucial for activity understanding. However, existing whole-body-object interaction video benchmarks overlook the truth that open-world objects are diverse, that is, they usually provide limited and predefined object classes. Therefore, we introduce a new open-world benchmark: Grounding Interacted Objects (GIO) including 1,098 interacted objects class and 290K interacted object boxes annotation. Accordingly, an object grounding task is proposed expecting vision systems to discover interacted objects. Even though today's detectors and grounding methods have succeeded greatly, they perform unsatisfactorily in localizing diverse and rare objects in GIO. This profoundly reveals the limitations of current vision systems and poses a great challenge. Thus, we explore leveraging spatio-temporal cues to address object grounding and propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos. Our method demonstrates significant superiority in extensive experiments compared to current baselines. Data and code will be publicly available at https://github.com/DirtyHarryLYL/HAKE-AVA.

Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

TL;DR

This work introduces Grounding Interacted Objects (GIO), a large open-world spatio-temporal HOI dataset built on AVA with 1,098 object classes and 290K object boxes, to study interacted-object grounding in third-view videos. It reframes object grounding as a 4D question-answering task (4D-QA) that uses SAM-generated proposals and a multi-modal 4D human-object layout to ground objects conditioned on a person’s interaction. The method demonstrates that fusing 2D context, 3D scene information, and HOI priors via a 4D-QA framework yields superior grounding performance over baselines, highlighting the importance of open-world object representations and temporal cues for HOI understanding. The dataset and approach collectively push toward deeper activity understanding and more robust open-world grounding for practical applications in surveillance, robotics, and human-centric AI.

Abstract

Spatio-temporal Human-Object Interaction (ST-HOI) understanding aims at detecting HOIs from videos, which is crucial for activity understanding. However, existing whole-body-object interaction video benchmarks overlook the truth that open-world objects are diverse, that is, they usually provide limited and predefined object classes. Therefore, we introduce a new open-world benchmark: Grounding Interacted Objects (GIO) including 1,098 interacted objects class and 290K interacted object boxes annotation. Accordingly, an object grounding task is proposed expecting vision systems to discover interacted objects. Even though today's detectors and grounding methods have succeeded greatly, they perform unsatisfactorily in localizing diverse and rare objects in GIO. This profoundly reveals the limitations of current vision systems and poses a great challenge. Thus, we explore leveraging spatio-temporal cues to address object grounding and propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos. Our method demonstrates significant superiority in extensive experiments compared to current baselines. Data and code will be publicly available at https://github.com/DirtyHarryLYL/HAKE-AVA.
Paper Structure (30 sections, 7 equations, 16 figures, 5 tables, 2 algorithms)

This paper contains 30 sections, 7 equations, 16 figures, 5 tables, 2 algorithms.

Figures (16)

  • Figure 1: In daily HOIs, we interact with diverse objects with limited actions. To this end, we build GIO on AVA AVA, annotating 1,000+ object classes to advance the study of HOI, with a long-tailed open-world object distribution. We propose an open-world interacted object grounding task based on GIO as in the right figure. Purple boxes indicate persons and green boxes indicate the grounded object.
  • Figure 2: The overview of our 4D-QA. It utilizes a 4D question-answering paradigm to effectively locate the interacted objects.
  • Figure 3: SAM-based candidate generation.
  • Figure 4: Visualization of interacted object grounding. We also list the reconstructions.
  • Figure 5: Fine-grained performance analysis.
  • ...and 11 more figures