PEAR: Phrase-Based Hand-Object Interaction Anticipation
Zichen Zhang, Hongchen Luo, Wei Zhai, Yang Cao, Yu Kang
TL;DR
This work tackles the challenge of anticipating a complete hand-object interaction, including pre-contact intention and post-contact manipulation. It introduces PEAR, a four-module framework that uses image-phrase fusion, intention exca vation, DEQ-based manipulation extraction, and probabilistic decoding with C-VAE decoders to jointly predict motion trends, hotspots, trajectories, and hand poses. A key contribution is the cross-alignment of verbs, nouns, and images to reduce intention uncertainty, paired with dynamic bidirectional constraints between intention and manipulation to mitigate manipulation uncertainty. The authors validate on the new EGO-HOIP dataset, achieving state-of-the-art results across multiple metrics and demonstrating robust, coherent predictions suitable for embodied AI and human-robot collaboration.
Abstract
First-person hand-object interaction anticipation aims to predict the interaction process over a forthcoming period based on current scenes and prompts. This capability is crucial for embodied intelligence and human-robot collaboration. The complete interaction process involves both pre-contact interaction intention (i.e., hand motion trends and interaction hotspots) and post-contact interaction manipulation (i.e., manipulation trajectories and hand poses with contact). Existing research typically anticipates only interaction intention while neglecting manipulation, resulting in incomplete predictions and an increased likelihood of intention errors due to the lack of manipulation constraints. To address this, we propose a novel model, PEAR (Phrase-Based Hand-Object Interaction Anticipation), which jointly anticipates interaction intention and manipulation. To handle uncertainties in the interaction process, we employ a twofold approach. Firstly, we perform cross-alignment of verbs, nouns, and images to reduce the diversity of hand movement patterns and object functional attributes, thereby mitigating intention uncertainty. Secondly, we establish bidirectional constraints between intention and manipulation using dynamic integration and residual connections, ensuring consistency among elements and thus overcoming manipulation uncertainty. To rigorously evaluate the performance of the proposed model, we collect a new task-relevant dataset, EGO-HOIP, with comprehensive annotations. Extensive experimental results demonstrate the superiority of our method.
