Table of Contents
Fetching ...

Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning

Azizul Zahid, Jie Fan, Farong Wang, Ashton Dy, Sai Swaminathan, Fei Liu

TL;DR

This work tackles the problem of aligning human and robot actions in manipulation by fusing RGB videos of humans with voxelized RGB-D representations of robot scenes. It introduces a two-branch architecture: a ResNet+LSTM-based human intention model and a voxel-based Perceiver Transformer for robot action prediction, trained on the RH20T pick-and-place dataset. The methods achieve comparable end-to-end accuracy around $71$–$72\%$ on both branches, with the robot model generally outperforming the human model on per-class tasks, particularly for mid-/late-stage actions; analysis highlights challenges in early-stage actions due to temporal ambiguity and data imbalance. The paper proposes a semantic alignment score $S(H,R)$ to quantify cross-modal correspondence and outlines future work to jointly optimize alignment, incorporate temporal voxel sequences, and integrate motion embeddings for a unified multimodal imitation-learning framework.

Abstract

Understanding action correspondence between humans and robots is essential for evaluating alignment in decision-making, particularly in human-robot collaboration and imitation learning within unstructured environments. We propose a multimodal demonstration learning framework that explicitly models human demonstrations from RGB video with robot demonstrations in voxelized RGB-D space. Focusing on the "pick and place" task from the RH20T dataset, we utilize data from 5 users across 10 diverse scenes. Our approach combines ResNet-based visual encoding for human intention modeling and a Perceiver Transformer for voxel-based robot action prediction. After 2000 training epochs, the human model reaches 71.67% accuracy, and the robot model achieves 71.8% accuracy, demonstrating the framework's potential for aligning complex, multimodal human and robot behaviors in manipulation tasks.

Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning

TL;DR

This work tackles the problem of aligning human and robot actions in manipulation by fusing RGB videos of humans with voxelized RGB-D representations of robot scenes. It introduces a two-branch architecture: a ResNet+LSTM-based human intention model and a voxel-based Perceiver Transformer for robot action prediction, trained on the RH20T pick-and-place dataset. The methods achieve comparable end-to-end accuracy around on both branches, with the robot model generally outperforming the human model on per-class tasks, particularly for mid-/late-stage actions; analysis highlights challenges in early-stage actions due to temporal ambiguity and data imbalance. The paper proposes a semantic alignment score to quantify cross-modal correspondence and outlines future work to jointly optimize alignment, incorporate temporal voxel sequences, and integrate motion embeddings for a unified multimodal imitation-learning framework.

Abstract

Understanding action correspondence between humans and robots is essential for evaluating alignment in decision-making, particularly in human-robot collaboration and imitation learning within unstructured environments. We propose a multimodal demonstration learning framework that explicitly models human demonstrations from RGB video with robot demonstrations in voxelized RGB-D space. Focusing on the "pick and place" task from the RH20T dataset, we utilize data from 5 users across 10 diverse scenes. Our approach combines ResNet-based visual encoding for human intention modeling and a Perceiver Transformer for voxel-based robot action prediction. After 2000 training epochs, the human model reaches 71.67% accuracy, and the robot model achieves 71.8% accuracy, demonstrating the framework's potential for aligning complex, multimodal human and robot behaviors in manipulation tasks.

Paper Structure

This paper contains 15 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Human and Robot demonstration framework.
  • Figure 2: Training loss and validation accuracy trend for three different learning rates: 0.001(blue), 0.0001(orange), 0.00001(green).
  • Figure 3: Confusion matrices for 8 intention-action classes of human/robot data.
  • Figure 4: Training and validation loss curves on voxelized robot action data.
  • Figure 5: Predicted softmax probability for each robot action class.