Table of Contents
Fetching ...

Imitation Learning with Precisely Labeled Human Demonstrations

Yilong Song

TL;DR

This paper tackles data scarcity in generalist robot imitation learning by introducing a lightweight, precise labeling pipeline that converts unlabeled human demonstrations into action-labeled sequences using a color-coded end-effector and a RANSAC+ICP pose-estimation workflow. It hinges on consistent end-effector appearance and an embodiment-invariant wrist view to enable zero-shot cross-embodiment grounding, and demonstrates that policies trained solely on precisely labeled human demonstrations can achieve substantial performance relative to robot demonstrations, with further gains when combining both data sources. In simulation, cross-embodiment data achieve about 88% of robot-only performance, and adding labeled human data to robot data enhances performance across tasks, highlighting the practicality of the approach for frontier vision-language-action models. The work emphasizes a simple hardware setup and a minimal, effective labeling strategy that can integrate with large-scale imitation learning pipelines, while acknowledging the need for real-world validation and broader ablations in future work.

Abstract

Within the imitation learning paradigm, training generalist robots requires large-scale datasets obtainable only through diverse curation. Due to the relative ease to collect, human demonstrations constitute a valuable addition when incorporated appropriately. However, existing methods utilizing human demonstrations face challenges in inferring precise actions, ameliorating embodiment gaps, and fusing with frontier generalist robot training pipelines. In this work, building on prior studies that demonstrate the viability of using hand-held grippers for efficient data collection, we leverage the user's control over the gripper's appearance--specifically by assigning it a unique, easily segmentable color--to enable simple and reliable application of the RANSAC and ICP registration method for precise end-effector pose estimation. We show in simulation that precisely labeled human demonstrations on their own allow policies to reach on average 88.1% of the performance of using robot demonstrations, and boost policy performance when combined with robot demonstrations, despite the inherent embodiment gap.

Imitation Learning with Precisely Labeled Human Demonstrations

TL;DR

This paper tackles data scarcity in generalist robot imitation learning by introducing a lightweight, precise labeling pipeline that converts unlabeled human demonstrations into action-labeled sequences using a color-coded end-effector and a RANSAC+ICP pose-estimation workflow. It hinges on consistent end-effector appearance and an embodiment-invariant wrist view to enable zero-shot cross-embodiment grounding, and demonstrates that policies trained solely on precisely labeled human demonstrations can achieve substantial performance relative to robot demonstrations, with further gains when combining both data sources. In simulation, cross-embodiment data achieve about 88% of robot-only performance, and adding labeled human data to robot data enhances performance across tasks, highlighting the practicality of the approach for frontier vision-language-action models. The work emphasizes a simple hardware setup and a minimal, effective labeling strategy that can integrate with large-scale imitation learning pipelines, while acknowledging the need for real-world validation and broader ablations in future work.

Abstract

Within the imitation learning paradigm, training generalist robots requires large-scale datasets obtainable only through diverse curation. Due to the relative ease to collect, human demonstrations constitute a valuable addition when incorporated appropriately. However, existing methods utilizing human demonstrations face challenges in inferring precise actions, ameliorating embodiment gaps, and fusing with frontier generalist robot training pipelines. In this work, building on prior studies that demonstrate the viability of using hand-held grippers for efficient data collection, we leverage the user's control over the gripper's appearance--specifically by assigning it a unique, easily segmentable color--to enable simple and reliable application of the RANSAC and ICP registration method for precise end-effector pose estimation. We show in simulation that precisely labeled human demonstrations on their own allow policies to reach on average 88.1% of the performance of using robot demonstrations, and boost policy performance when combined with robot demonstrations, despite the inherent embodiment gap.

Paper Structure

This paper contains 9 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: Going from left to right: 1) The triangle mesh of the recolored panda gripper used in simulation; 2) the corresponding point cloud sampled uniformly from the mesh, which we use for pose estimation; 3) a point cloud of the robot (cropped from the scene for clarity) in a simulated demonstration; 4) Random Sample Consensus (RANSAC) and Iterative Closest Point (ICP) are applied to estimate a rigid body transformation that aligns the end-effector point cloud obtained from the mesh and the robot point cloud (i.e. the pose), resulting in precise and reliable end-effector pose estimation without the need to train a deep learning model.
  • Figure 2: Downstream Policy Performance on Task Square $D_0$ During Training on Different Data Mixtures. We visualize the task success rate of a visual diffusion policy averaged across 50 different environment initial conditions on different mixtures of teleoperated demonstrations (TD) and simulated human demonstrations (HD) during training.