Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning
Azizul Zahid, Jie Fan, Farong Wang, Ashton Dy, Sai Swaminathan, Fei Liu
TL;DR
This work tackles the problem of aligning human and robot actions in manipulation by fusing RGB videos of humans with voxelized RGB-D representations of robot scenes. It introduces a two-branch architecture: a ResNet+LSTM-based human intention model and a voxel-based Perceiver Transformer for robot action prediction, trained on the RH20T pick-and-place dataset. The methods achieve comparable end-to-end accuracy around $71$–$72\%$ on both branches, with the robot model generally outperforming the human model on per-class tasks, particularly for mid-/late-stage actions; analysis highlights challenges in early-stage actions due to temporal ambiguity and data imbalance. The paper proposes a semantic alignment score $S(H,R)$ to quantify cross-modal correspondence and outlines future work to jointly optimize alignment, incorporate temporal voxel sequences, and integrate motion embeddings for a unified multimodal imitation-learning framework.
Abstract
Understanding action correspondence between humans and robots is essential for evaluating alignment in decision-making, particularly in human-robot collaboration and imitation learning within unstructured environments. We propose a multimodal demonstration learning framework that explicitly models human demonstrations from RGB video with robot demonstrations in voxelized RGB-D space. Focusing on the "pick and place" task from the RH20T dataset, we utilize data from 5 users across 10 diverse scenes. Our approach combines ResNet-based visual encoding for human intention modeling and a Perceiver Transformer for voxel-based robot action prediction. After 2000 training epochs, the human model reaches 71.67% accuracy, and the robot model achieves 71.8% accuracy, demonstrating the framework's potential for aligning complex, multimodal human and robot behaviors in manipulation tasks.
