Scaling Single Human Demonstrations for Imitation Learning using Generative Foundational Models

Nick Heppert; Minh Quang Nguyen; Abhinav Valada

Scaling Single Human Demonstrations for Imitation Learning using Generative Foundational Models

Nick Heppert, Minh Quang Nguyen, Abhinav Valada

TL;DR

This work proposes Real2Gen to train a manipulation policy from a single human demonstration, and evaluates Real2Gen on human demonstrations from three different real-world tasks and compares it to a recent baseline.

Abstract

Imitation learning is a popular paradigm to teach robots new tasks, but collecting robot demonstrations through teleoperation or kinesthetic teaching is tedious and time-consuming. In contrast, directly demonstrating a task using our human embodiment is much easier and data is available in abundance, yet transfer to the robot can be non-trivial. In this work, we propose Real2Gen to train a manipulation policy from a single human demonstration. Real2Gen extracts required information from the demonstration and transfers it to a simulation environment, where a programmable expert agent can demonstrate the task arbitrarily many times, generating an unlimited amount of data to train a flow matching policy. We evaluate Real2Gen on human demonstrations from three different real-world tasks and compare it to a recent baseline. Real2Gen shows an average increase in the success rate of 26.6% and better generalization of the trained policy due to the abundance and diversity of training data. We further deploy our purely simulation-trained policy zero-shot in the real world. We make the data, code, and trained models publicly available at real2gen.cs.uni-freiburg.de.

Scaling Single Human Demonstrations for Imitation Learning using Generative Foundational Models

TL;DR

Abstract

Paper Structure (18 sections, 6 figures, 4 tables)

This paper contains 18 sections, 6 figures, 4 tables.

Introduction
Related Work
Learning from Human Demonstrations
Learning through Procedural and Generative Simulation
Technical Approach
Pre-Processing Human Demonstrations
Asset Generation
Demonstration Generation
Policy Learning
Experimental Evaluation
Quantitative Full Pipeline Evaluation
Comparison of Mesh Generation
VLMs as Size and Pose Estimators
Real-World Robotic Experiments
Conclusion
...and 3 more sections

Figures (6)

Figure 1: Overview of Real2Gen. Real2Gen takes a single human demonstration as input and produces simulatable meshes using 3D generative foundational models, which can be used in a generative simulation setup.
Figure 2: Technical approach of Real2Gen. Real2Gen uses a single human demonstration as input, consisting of a sequence of RGB-D images. We pre-process (\ref{['sec:method:preprocessing']}) these images using DITTO heppert2024ditto to retrieve a primary and, if applicable, a secondary object mask as well as an object-centric trajectory of the object. In the second step, asset generation (\ref{['sec:method:datagen']}), we pass object images to Point-E nichol2022point to generate 3D meshes in a canonical space. We then use Zero-Shot-Pose (ZSP) goodwin2022zero to scale and align the meshes to the human demonstration. We then use the generated meshes combined with object-centric trajectories to set up a simulation (\ref{['subsec:method:demo_gen']}). Using grasp and motion planning, we use the simulation to generate an expert dataset of policy rollouts. In the last step, policy learning (\ref{['subsec:method:policy_learning']}), we use the collected dataset to train a conditional flow matching policy chisari2024learning.
Figure 3: Results of Ablation Study. We show the average success rate $\lbrack \% \rbrack$ ($\uparrow$) across all tasks. We either vary the number of demonstrations while using five meshes or we vary the number of meshes using 800 demonstrations.
Figure 4: Precision Curve for Scaling Factor. We plot the percentage of meshes below the relative size error ranging.
Figure 5: Real-World Robot Experiment. Failure cases include imperfect grasping and premature closing off the gripper.
...and 1 more figures

Scaling Single Human Demonstrations for Imitation Learning using Generative Foundational Models

TL;DR

Abstract

Scaling Single Human Demonstrations for Imitation Learning using Generative Foundational Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)