Hierarchical Deep Learning for Intention Estimation of Teleoperation Manipulation in Assembly Tasks
Mingyu Cai, Karankumar Patel, Soshi Iba, Songpo Li
TL;DR
This work tackles the problem of estimating human intention in teleoperation for assembly tasks by predicting a high-level task $T_t$ and a low-level action $A_t$ from online observations. It introduces a hierarchical deep model with a root backbone, task and action encoders, and a conditional action head, trained with a weighted loss $Loss(\theta) = \alpha ELoss + \beta DLoss$ to enforce cross-level consistency. A multi-window masking strategy tailors temporal horizons for each prediction level, and a vision-based extension uses a Slow-Fast backbone to handle egocentric video inputs. Empirical results in a VR two-hand setup across six tasks with 202 demonstrations show that hierarchical designs improve accuracy over independent baselines for both motion and egocentric-vision inputs, with further gains from the multi-window approach and online inference around 2 Hz.
Abstract
In human-robot collaboration, shared control presents an opportunity to teleoperate robotic manipulation to improve the efficiency of manufacturing and assembly processes. Robots are expected to assist in executing the user's intentions. To this end, robust and prompt intention estimation is needed, relying on behavioral observations. The framework presents an intention estimation technique at hierarchical levels i.e., low-level actions and high-level tasks, by incorporating multi-scale hierarchical information in neural networks. Technically, we employ hierarchical dependency loss to boost overall accuracy. Furthermore, we propose a multi-window method that assigns proper hierarchical prediction windows of input data. An analysis of the predictive power with various inputs demonstrates the predominance of the deep hierarchical model in the sense of prediction accuracy and early intention identification. We implement the algorithm on a virtual reality (VR) setup to teleoperate robotic hands in a simulation with various assembly tasks to show the effectiveness of online estimation.
