Table of Contents
Fetching ...

EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction

Irving Fang, Yuzhong Chen, Yifan Wang, Jianghan Zhang, Qiushi Zhang, Jiali Xu, Xibo He, Weibo Gao, Hao Su, Yiming Li, Chen Feng

TL;DR

This study expands EgoPAT3D, the sole dataset dedicated to egocentric 3D action target prediction, and substantially enhances the baseline algorithm by introducing a large pre-trained model and human prior knowledge.

Abstract

A robot's ability to anticipate the 3D action target location of a hand's movement from egocentric videos can greatly improve safety and efficiency in human-robot interaction (HRI). While previous research predominantly focused on semantic action classification or 2D target region prediction, we argue that predicting the action target's 3D coordinate could pave the way for more versatile downstream robotics tasks, especially given the increasing prevalence of headset devices. This study expands EgoPAT3D, the sole dataset dedicated to egocentric 3D action target prediction. We augment both its size and diversity, enhancing its potential for generalization. Moreover, we substantially enhance the baseline algorithm by introducing a large pre-trained model and human prior knowledge. Remarkably, our novel algorithm can now achieve superior prediction outcomes using solely RGB images, eliminating the previous need for 3D point clouds and IMU input. Furthermore, we deploy our enhanced baseline algorithm on a real-world robotic platform to illustrate its practical utility in straightforward HRI tasks. The demonstrations showcase the real-world applicability of our advancements and may inspire more HRI use cases involving egocentric vision. All code and data are open-sourced and can be found on the project website.

EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction

TL;DR

This study expands EgoPAT3D, the sole dataset dedicated to egocentric 3D action target prediction, and substantially enhances the baseline algorithm by introducing a large pre-trained model and human prior knowledge.

Abstract

A robot's ability to anticipate the 3D action target location of a hand's movement from egocentric videos can greatly improve safety and efficiency in human-robot interaction (HRI). While previous research predominantly focused on semantic action classification or 2D target region prediction, we argue that predicting the action target's 3D coordinate could pave the way for more versatile downstream robotics tasks, especially given the increasing prevalence of headset devices. This study expands EgoPAT3D, the sole dataset dedicated to egocentric 3D action target prediction. We augment both its size and diversity, enhancing its potential for generalization. Moreover, we substantially enhance the baseline algorithm by introducing a large pre-trained model and human prior knowledge. Remarkably, our novel algorithm can now achieve superior prediction outcomes using solely RGB images, eliminating the previous need for 3D point clouds and IMU input. Furthermore, we deploy our enhanced baseline algorithm on a real-world robotic platform to illustrate its practical utility in straightforward HRI tasks. The demonstrations showcase the real-world applicability of our advancements and may inspire more HRI use cases involving egocentric vision. All code and data are open-sourced and can be found on the project website.
Paper Structure (27 sections, 3 equations, 3 figures, 2 tables)

This paper contains 27 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Real-World Demonstration of EgoPAT3Dv2. A human wearing a helmet camera manipulates objects in a shared workspace with a UR10E cobot. The cobot tries to reach the anticipated 3D action target with the shortest Cartesian path.
  • Figure 2: Algorithm Workflow. Visual and hand features are extracted from RGB images and fused with an MLP. The fused feature is fed into an LSTM to produce the initial prediction, which is then adjusted by post-processing that considers our prior knowledge about manipulation. Note that LSTM and Post-Processing both rely on previous frames' information.
  • Figure 3: Distribution of the clip length in EgoPAT3D and EgoPAT3Dv2 dataset