Imitation Learning with Precisely Labeled Human Demonstrations
Yilong Song
TL;DR
This paper tackles data scarcity in generalist robot imitation learning by introducing a lightweight, precise labeling pipeline that converts unlabeled human demonstrations into action-labeled sequences using a color-coded end-effector and a RANSAC+ICP pose-estimation workflow. It hinges on consistent end-effector appearance and an embodiment-invariant wrist view to enable zero-shot cross-embodiment grounding, and demonstrates that policies trained solely on precisely labeled human demonstrations can achieve substantial performance relative to robot demonstrations, with further gains when combining both data sources. In simulation, cross-embodiment data achieve about 88% of robot-only performance, and adding labeled human data to robot data enhances performance across tasks, highlighting the practicality of the approach for frontier vision-language-action models. The work emphasizes a simple hardware setup and a minimal, effective labeling strategy that can integrate with large-scale imitation learning pipelines, while acknowledging the need for real-world validation and broader ablations in future work.
Abstract
Within the imitation learning paradigm, training generalist robots requires large-scale datasets obtainable only through diverse curation. Due to the relative ease to collect, human demonstrations constitute a valuable addition when incorporated appropriately. However, existing methods utilizing human demonstrations face challenges in inferring precise actions, ameliorating embodiment gaps, and fusing with frontier generalist robot training pipelines. In this work, building on prior studies that demonstrate the viability of using hand-held grippers for efficient data collection, we leverage the user's control over the gripper's appearance--specifically by assigning it a unique, easily segmentable color--to enable simple and reliable application of the RANSAC and ICP registration method for precise end-effector pose estimation. We show in simulation that precisely labeled human demonstrations on their own allow policies to reach on average 88.1% of the performance of using robot demonstrations, and boost policy performance when combined with robot demonstrations, despite the inherent embodiment gap.
