PEAR: Phrase-Based Hand-Object Interaction Anticipation

Zichen Zhang; Hongchen Luo; Wei Zhai; Yang Cao; Yu Kang

PEAR: Phrase-Based Hand-Object Interaction Anticipation

Zichen Zhang, Hongchen Luo, Wei Zhai, Yang Cao, Yu Kang

TL;DR

This work tackles the challenge of anticipating a complete hand-object interaction, including pre-contact intention and post-contact manipulation. It introduces PEAR, a four-module framework that uses image-phrase fusion, intention exca vation, DEQ-based manipulation extraction, and probabilistic decoding with C-VAE decoders to jointly predict motion trends, hotspots, trajectories, and hand poses. A key contribution is the cross-alignment of verbs, nouns, and images to reduce intention uncertainty, paired with dynamic bidirectional constraints between intention and manipulation to mitigate manipulation uncertainty. The authors validate on the new EGO-HOIP dataset, achieving state-of-the-art results across multiple metrics and demonstrating robust, coherent predictions suitable for embodied AI and human-robot collaboration.

Abstract

First-person hand-object interaction anticipation aims to predict the interaction process over a forthcoming period based on current scenes and prompts. This capability is crucial for embodied intelligence and human-robot collaboration. The complete interaction process involves both pre-contact interaction intention (i.e., hand motion trends and interaction hotspots) and post-contact interaction manipulation (i.e., manipulation trajectories and hand poses with contact). Existing research typically anticipates only interaction intention while neglecting manipulation, resulting in incomplete predictions and an increased likelihood of intention errors due to the lack of manipulation constraints. To address this, we propose a novel model, PEAR (Phrase-Based Hand-Object Interaction Anticipation), which jointly anticipates interaction intention and manipulation. To handle uncertainties in the interaction process, we employ a twofold approach. Firstly, we perform cross-alignment of verbs, nouns, and images to reduce the diversity of hand movement patterns and object functional attributes, thereby mitigating intention uncertainty. Secondly, we establish bidirectional constraints between intention and manipulation using dynamic integration and residual connections, ensuring consistency among elements and thus overcoming manipulation uncertainty. To rigorously evaluate the performance of the proposed model, we collect a new task-relevant dataset, EGO-HOIP, with comprehensive annotations. Extensive experimental results demonstrate the superiority of our method.

PEAR: Phrase-Based Hand-Object Interaction Anticipation

TL;DR

Abstract

Paper Structure (27 sections, 18 equations, 10 figures, 4 tables)

This paper contains 27 sections, 18 equations, 10 figures, 4 tables.

Introduction
Related Work
Hand-Object Interaction Understanding
Hand-Object Interaction Prediction
Hand-Object Interaction Generation
Method
Image-Phrase Fusion Module
Interaction Intention Excavation Module
Interaction Manipulation Extraction Module
Probabilistic Modeling Prediction Module
Loss Functions
Datasets
Dataset Collection
Dataset Annotation
Statistic Analysis
...and 12 more sections

Figures (10)

Figure 1: Given an image of the pre-interaction scenario and a phrase, PEAR anticipates the hand-object interaction process over a period of time, including interaction intention (i.e., hand motion trends and interaction hotspots) and interaction manipulation (i.e., manipulation trajectories and hand poses with contact).
Figure 2: Motivation. We address both intention uncertainty and manipulation uncertainty with specific solutions. (a) We reduce the scope of intention elements by cross-aligning nouns, verbs, and images, thereby overcoming intention uncertainty. (b) We derive manipulation elements through the dynamic integration of intention elements, simultaneously refining the initial intention via manipulation, thereby mitigating manipulation uncertainty.
Figure 3: PEAR pipeline. The proposed model takes an image and a phrase as inputs to anticipate future interaction elements. It consists of four components: an image-phrase fusion module, an interaction intention excavation module, an interaction manipulation extraction module, and a probabilistic modeling prediction module.
Figure 4: Details of PEAR's structure. (a) DEQ Extraction model is a crucial part of the Interaction Manipulation Extraction Module. This model takes the hand motion feature and the interaction hotspots feature as inputs, producing a fused feature in the equilibrium state, which functions as the manipulation feature. (b) In the Probabilistic Modeling Prediction Module, both hand motion trends and manipulation trajectories utilize chain-structured C-VAEs. The hand pose decoder incorporates the parametric MANO model to enhance the robustness of predictions.
Figure 5: EGO-HOIP Dataset. (a) We detail the data and annotations in our dataset. Each sample comprises an interaction prompt phrase and a pre-interaction image representing the interaction scenario. We automatically generate annotations for hand trajectories both before and after contact, as well as for 3D hand poses. Additionally, we manually annotate interaction hotspots and the hand contact. (b) shows examples of verb-noun pairings in the phrases. (c) and (d) are word clouds for nouns and verbs, respectively, demonstrating the diversity of interaction prompts.
...and 5 more figures

PEAR: Phrase-Based Hand-Object Interaction Anticipation

TL;DR

Abstract

PEAR: Phrase-Based Hand-Object Interaction Anticipation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)