Hierarchical Deep Learning for Intention Estimation of Teleoperation Manipulation in Assembly Tasks

Mingyu Cai; Karankumar Patel; Soshi Iba; Songpo Li

Hierarchical Deep Learning for Intention Estimation of Teleoperation Manipulation in Assembly Tasks

Mingyu Cai, Karankumar Patel, Soshi Iba, Songpo Li

TL;DR

This work tackles the problem of estimating human intention in teleoperation for assembly tasks by predicting a high-level task $T_t$ and a low-level action $A_t$ from online observations. It introduces a hierarchical deep model with a root backbone, task and action encoders, and a conditional action head, trained with a weighted loss $Loss(\theta) = \alpha ELoss + \beta DLoss$ to enforce cross-level consistency. A multi-window masking strategy tailors temporal horizons for each prediction level, and a vision-based extension uses a Slow-Fast backbone to handle egocentric video inputs. Empirical results in a VR two-hand setup across six tasks with 202 demonstrations show that hierarchical designs improve accuracy over independent baselines for both motion and egocentric-vision inputs, with further gains from the multi-window approach and online inference around 2 Hz.

Abstract

In human-robot collaboration, shared control presents an opportunity to teleoperate robotic manipulation to improve the efficiency of manufacturing and assembly processes. Robots are expected to assist in executing the user's intentions. To this end, robust and prompt intention estimation is needed, relying on behavioral observations. The framework presents an intention estimation technique at hierarchical levels i.e., low-level actions and high-level tasks, by incorporating multi-scale hierarchical information in neural networks. Technically, we employ hierarchical dependency loss to boost overall accuracy. Furthermore, we propose a multi-window method that assigns proper hierarchical prediction windows of input data. An analysis of the predictive power with various inputs demonstrates the predominance of the deep hierarchical model in the sense of prediction accuracy and early intention identification. We implement the algorithm on a virtual reality (VR) setup to teleoperate robotic hands in a simulation with various assembly tasks to show the effectiveness of online estimation.

Hierarchical Deep Learning for Intention Estimation of Teleoperation Manipulation in Assembly Tasks

TL;DR

This work tackles the problem of estimating human intention in teleoperation for assembly tasks by predicting a high-level task

and a low-level action

from online observations. It introduces a hierarchical deep model with a root backbone, task and action encoders, and a conditional action head, trained with a weighted loss

to enforce cross-level consistency. A multi-window masking strategy tailors temporal horizons for each prediction level, and a vision-based extension uses a Slow-Fast backbone to handle egocentric video inputs. Empirical results in a VR two-hand setup across six tasks with 202 demonstrations show that hierarchical designs improve accuracy over independent baselines for both motion and egocentric-vision inputs, with further gains from the multi-window approach and online inference around 2 Hz.

Abstract

Paper Structure (10 sections, 9 equations, 6 figures, 3 tables)

This paper contains 10 sections, 9 equations, 6 figures, 3 tables.

INTRODUCTION
Problem Formulation
Method
Deep Hierarchical Model
Multi-window Strategy
Vision-based Deep Hierarchical Model
Manipulation Assistive Control
Experimental Results
Discussion and Conclusion
Acknowledgment

Figures (6)

Figure 1: Experimental setup for the data collection and model testing. The movements of human operator's head, hands, and eye gaze are tracked via HTC Vive virtual reality system. The top-left corner of screen visualizes the scene as perceived by the operator's point of view, and the background scene shows the global view of a teleoperation process. Action and task estimation results are shown in the middle and top right screen respectively.
Figure 2: The task-action hierarchical deep learning model including dependent loss functions and leaf layers conditional by the embeddings from its root layer.
Figure 3: Hierarchical Slow-Fast model accepts only visual inputs without feature extractions.
Figure 4: Toy assembly tasks with one of instructions.
Figure 5: On the x axis the predictions, on the y the ground truth. Numbers represents the frequency with which samples of a certain class (row) were classified with the label on the corresponding
...and 1 more figures

Hierarchical Deep Learning for Intention Estimation of Teleoperation Manipulation in Assembly Tasks

TL;DR

Abstract

Hierarchical Deep Learning for Intention Estimation of Teleoperation Manipulation in Assembly Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)