EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction

Irving Fang; Yuzhong Chen; Yifan Wang; Jianghan Zhang; Qiushi Zhang; Jiali Xu; Xibo He; Weibo Gao; Hao Su; Yiming Li; Chen Feng

EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction

Irving Fang, Yuzhong Chen, Yifan Wang, Jianghan Zhang, Qiushi Zhang, Jiali Xu, Xibo He, Weibo Gao, Hao Su, Yiming Li, Chen Feng

TL;DR

This study expands EgoPAT3D, the sole dataset dedicated to egocentric 3D action target prediction, and substantially enhances the baseline algorithm by introducing a large pre-trained model and human prior knowledge.

Abstract

A robot's ability to anticipate the 3D action target location of a hand's movement from egocentric videos can greatly improve safety and efficiency in human-robot interaction (HRI). While previous research predominantly focused on semantic action classification or 2D target region prediction, we argue that predicting the action target's 3D coordinate could pave the way for more versatile downstream robotics tasks, especially given the increasing prevalence of headset devices. This study expands EgoPAT3D, the sole dataset dedicated to egocentric 3D action target prediction. We augment both its size and diversity, enhancing its potential for generalization. Moreover, we substantially enhance the baseline algorithm by introducing a large pre-trained model and human prior knowledge. Remarkably, our novel algorithm can now achieve superior prediction outcomes using solely RGB images, eliminating the previous need for 3D point clouds and IMU input. Furthermore, we deploy our enhanced baseline algorithm on a real-world robotic platform to illustrate its practical utility in straightforward HRI tasks. The demonstrations showcase the real-world applicability of our advancements and may inspire more HRI use cases involving egocentric vision. All code and data are open-sourced and can be found on the project website.

EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction

TL;DR

Abstract

Paper Structure (27 sections, 3 equations, 3 figures, 2 tables)

This paper contains 27 sections, 3 equations, 3 figures, 2 tables.

Introduction
Related Work
Human-Robot Interaction with Egocentric Human Action Anticipation.
Egocentric Vision and Datasets.
Method
Overview
Problem Formulation and Notation
Improved Baseline Algorithm
RGB and Hand Feature Encoding
Online 3D Target Prediction
Post-Processing
Improved Loss Function
Hand Position Loss
Time Loss
Experiment
...and 12 more sections

Figures (3)

Figure 1: Real-World Demonstration of EgoPAT3Dv2. A human wearing a helmet camera manipulates objects in a shared workspace with a UR10E cobot. The cobot tries to reach the anticipated 3D action target with the shortest Cartesian path.
Figure 2: Algorithm Workflow. Visual and hand features are extracted from RGB images and fused with an MLP. The fused feature is fed into an LSTM to produce the initial prediction, which is then adjusted by post-processing that considers our prior knowledge about manipulation. Note that LSTM and Post-Processing both rely on previous frames' information.
Figure 3: Distribution of the clip length in EgoPAT3D and EgoPAT3Dv2 dataset

EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction

TL;DR

Abstract

EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (3)