Table of Contents
Fetching ...

Scaling Single Human Demonstrations for Imitation Learning using Generative Foundational Models

Nick Heppert, Minh Quang Nguyen, Abhinav Valada

TL;DR

This work proposes Real2Gen to train a manipulation policy from a single human demonstration, and evaluates Real2Gen on human demonstrations from three different real-world tasks and compares it to a recent baseline.

Abstract

Imitation learning is a popular paradigm to teach robots new tasks, but collecting robot demonstrations through teleoperation or kinesthetic teaching is tedious and time-consuming. In contrast, directly demonstrating a task using our human embodiment is much easier and data is available in abundance, yet transfer to the robot can be non-trivial. In this work, we propose Real2Gen to train a manipulation policy from a single human demonstration. Real2Gen extracts required information from the demonstration and transfers it to a simulation environment, where a programmable expert agent can demonstrate the task arbitrarily many times, generating an unlimited amount of data to train a flow matching policy. We evaluate Real2Gen on human demonstrations from three different real-world tasks and compare it to a recent baseline. Real2Gen shows an average increase in the success rate of 26.6% and better generalization of the trained policy due to the abundance and diversity of training data. We further deploy our purely simulation-trained policy zero-shot in the real world. We make the data, code, and trained models publicly available at real2gen.cs.uni-freiburg.de.

Scaling Single Human Demonstrations for Imitation Learning using Generative Foundational Models

TL;DR

This work proposes Real2Gen to train a manipulation policy from a single human demonstration, and evaluates Real2Gen on human demonstrations from three different real-world tasks and compares it to a recent baseline.

Abstract

Imitation learning is a popular paradigm to teach robots new tasks, but collecting robot demonstrations through teleoperation or kinesthetic teaching is tedious and time-consuming. In contrast, directly demonstrating a task using our human embodiment is much easier and data is available in abundance, yet transfer to the robot can be non-trivial. In this work, we propose Real2Gen to train a manipulation policy from a single human demonstration. Real2Gen extracts required information from the demonstration and transfers it to a simulation environment, where a programmable expert agent can demonstrate the task arbitrarily many times, generating an unlimited amount of data to train a flow matching policy. We evaluate Real2Gen on human demonstrations from three different real-world tasks and compare it to a recent baseline. Real2Gen shows an average increase in the success rate of 26.6% and better generalization of the trained policy due to the abundance and diversity of training data. We further deploy our purely simulation-trained policy zero-shot in the real world. We make the data, code, and trained models publicly available at real2gen.cs.uni-freiburg.de.
Paper Structure (18 sections, 6 figures, 4 tables)

This paper contains 18 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of Real2Gen. Real2Gen takes a single human demonstration as input and produces simulatable meshes using 3D generative foundational models, which can be used in a generative simulation setup.
  • Figure 2: Technical approach of Real2Gen. Real2Gen uses a single human demonstration as input, consisting of a sequence of RGB-D images. We pre-process (\ref{['sec:method:preprocessing']}) these images using DITTO heppert2024ditto to retrieve a primary and, if applicable, a secondary object mask as well as an object-centric trajectory of the object. In the second step, asset generation (\ref{['sec:method:datagen']}), we pass object images to Point-E nichol2022point to generate 3D meshes in a canonical space. We then use Zero-Shot-Pose (ZSP) goodwin2022zero to scale and align the meshes to the human demonstration. We then use the generated meshes combined with object-centric trajectories to set up a simulation (\ref{['subsec:method:demo_gen']}). Using grasp and motion planning, we use the simulation to generate an expert dataset of policy rollouts. In the last step, policy learning (\ref{['subsec:method:policy_learning']}), we use the collected dataset to train a conditional flow matching policy chisari2024learning.
  • Figure 3: Results of Ablation Study. We show the average success rate $\lbrack \% \rbrack$ ($\uparrow$) across all tasks. We either vary the number of demonstrations while using five meshes or we vary the number of meshes using 800 demonstrations.
  • Figure 4: Precision Curve for Scaling Factor. We plot the percentage of meshes below the relative size error ranging.
  • Figure 5: Real-World Robot Experiment. Failure cases include imperfect grasping and premature closing off the gripper.
  • ...and 1 more figures