EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views

Yuhang Yang; Wei Zhai; Chengfeng Wang; Chengjun Yu; Yang Cao; Zheng-Jun Zha

EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views

Yuhang Yang, Wei Zhai, Chengfeng Wang, Chengjun Yu, Yang Cao, Zheng-Jun Zha

TL;DR

Egocentric HOI requires spatial localization of interactions in 3D space, but egocentric views often provide incomplete observations. EgoChoir solves this by harmonizing visual appearance, head motion, and 3D object geometry through parallel cross-attention with gradient modulation to infer 3D human contact and object affordance. The approach introduces object interaction concepts and subject intentions to robustly estimate interaction regions across diverse scenarios, and it is supported by a dataset with 3D annotations on Ego-Exo4D and GIMO. Empirical results show state-of-the-art performance and strong ablations confirm the contribution of each component. This work advances realistic 3D HOI understanding from egocentric perspectives for embodied AI and related applications.

Abstract

Understanding egocentric human-object interaction (HOI) is a fundamental aspect of human-centric perception, facilitating applications like AR/VR and embodied AI. For the egocentric HOI, in addition to perceiving semantics e.g., ''what'' interaction is occurring, capturing ''where'' the interaction specifically manifests in 3D space is also crucial, which links the perception and operation. Existing methods primarily leverage observations of HOI to capture interaction regions from an exocentric view. However, incomplete observations of interacting parties in the egocentric view introduce ambiguity between visual observations and interaction contents, impairing their efficacy. From the egocentric view, humans integrate the visual cortex, cerebellum, and brain to internalize their intentions and interaction concepts of objects, allowing for the pre-formulation of interactions and making behaviors even when interaction regions are out of sight. In light of this, we propose harmonizing the visual appearance, head motion, and 3D object to excavate the object interaction concept and subject intention, jointly inferring 3D human contact and object affordance from egocentric videos. To achieve this, we present EgoChoir, which links object structures with interaction contexts inherent in appearance and head motion to reveal object affordance, further utilizing it to model human contact. Additionally, a gradient modulation is employed to adopt appropriate clues for capturing interaction regions across various egocentric scenarios. Moreover, 3D contact and affordance are annotated for egocentric videos collected from Ego-Exo4D and GIMO to support the task. Extensive experiments on them demonstrate the effectiveness and superiority of EgoChoir.

EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views

TL;DR

Abstract

Paper Structure (27 sections, 8 equations, 11 figures, 7 tables)

This paper contains 27 sections, 8 equations, 11 figures, 7 tables.

Introduction
Related Work
Method
Preliminaries
Modality-wise feature extraction
Modeling object affordance and human contact
Gradient modulation
Experiment
Experimental setup
Experimental results
Ablation study
Performance analysis
Discussion and conclusion
Implementation Details
Method details
...and 12 more sections

Figures (11)

Figure 1: EgoChoir takes egocentric frames and head motion from head-mounted devices, along with the 3D object, to capture 3D interaction regions, including human contact and object affordance. The human motion is just visualized for intuitive observation of contact, yet it is not utilized by EgoChoir.
Figure 2: The subject intention, conveyed through synergistic visual appearances and head movements, along with the object interaction concept revealed by its structure and functionality, pre-formulate an interaction body image, which enables interaction regions to be envisioned.
Figure 3: Method. EgoChoir first employs modality-wise encoders to extract features, in which the motion encoder is pre-trained by minimizing the distance between visual disparity and motion disparity. Then, it takes them to excavate the object interaction concept and subject intention, modeling the affordance and contact through parallel cross-attention with gradient modulation.
Figure 4: Dataset Distribution.(a) The distribution of different interaction categories and objects in video clips. (b) Category distribution of 3D object affordance annotation. (c) Distribution of contact annotations on human body parts.
Figure 5: Annotation of 3D human contact and object affordance.(a) Annotate contact for data in Ego-Exo4D. (b) Contact annotation for GIMO dataset, including calculations and manual refinement. (c) 3D object affordance annotation, with the red region denoting that with higher interaction probability, while the blue region indicates the adjacent propagable region.
...and 6 more figures

EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views

TL;DR

Abstract

EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views

Authors

TL;DR

Abstract

Table of Contents

Figures (11)