Table of Contents
Fetching ...

Robot Learning with Super-Linear Scaling

Marcel Torne, Arhan Jain, Jiayi Yuan, Vidaaranya Macha, Lars Ankile, Anthony Simeonov, Pulkit Agrawal, Abhishek Gupta

TL;DR

This work introduces CASHER, a real-to-sim-to-real pipeline that crowdsources digital twins of real scenes to train generalist robotic policies with sublinear human effort. By coupling real-world scene reconstruction with simulation-based data collection and model-generated demonstrations, CASHER achieves zero-shot and few-shot scaling across multiple tasks and enables scanned deployment fine-tuning without additional demonstrations. The paper provides a detailed evaluation across tasks, showing increasing zero-shot success with more environments, significant reductions in human data requirements, and robust fine-tuning capabilities. It also discusses limitations, notably compute burden and sim-to-real gaps, offering a scalable path toward robotic foundation models.

Abstract

Scaling robot learning requires data collection pipelines that scale favorably with human effort. In this work, we propose Crowdsourcing and Amortizing Human Effort for Real-to-Sim-to-Real(CASHER), a pipeline for scaling up data collection and learning in simulation where the performance scales superlinearly with human effort. The key idea is to crowdsource digital twins of real-world scenes using 3D reconstruction and collect large-scale data in simulation, rather than the real-world. Data collection in simulation is initially driven by RL, bootstrapped with human demonstrations. As the training of a generalist policy progresses across environments, its generalization capabilities can be used to replace human effort with model generated demonstrations. This results in a pipeline where behavioral data is collected in simulation with continually reducing human effort. We show that CASHER demonstrates zero-shot and few-shot scaling laws on three real-world tasks across diverse scenarios. We show that CASHER enables fine-tuning of pre-trained policies to a target scenario using a video scan without any additional human effort. See our project website: https://casher-robot-learning.github.io/CASHER/

Robot Learning with Super-Linear Scaling

TL;DR

This work introduces CASHER, a real-to-sim-to-real pipeline that crowdsources digital twins of real scenes to train generalist robotic policies with sublinear human effort. By coupling real-world scene reconstruction with simulation-based data collection and model-generated demonstrations, CASHER achieves zero-shot and few-shot scaling across multiple tasks and enables scanned deployment fine-tuning without additional demonstrations. The paper provides a detailed evaluation across tasks, showing increasing zero-shot success with more environments, significant reductions in human data requirements, and robust fine-tuning capabilities. It also discusses limitations, notably compute burden and sim-to-real gaps, offering a scalable path toward robotic foundation models.

Abstract

Scaling robot learning requires data collection pipelines that scale favorably with human effort. In this work, we propose Crowdsourcing and Amortizing Human Effort for Real-to-Sim-to-Real(CASHER), a pipeline for scaling up data collection and learning in simulation where the performance scales superlinearly with human effort. The key idea is to crowdsource digital twins of real-world scenes using 3D reconstruction and collect large-scale data in simulation, rather than the real-world. Data collection in simulation is initially driven by RL, bootstrapped with human demonstrations. As the training of a generalist policy progresses across environments, its generalization capabilities can be used to replace human effort with model generated demonstrations. This results in a pipeline where behavioral data is collected in simulation with continually reducing human effort. We show that CASHER demonstrates zero-shot and few-shot scaling laws on three real-world tasks across diverse scenarios. We show that CASHER enables fine-tuning of pre-trained policies to a target scenario using a video scan without any additional human effort. See our project website: https://casher-robot-learning.github.io/CASHER/

Paper Structure

This paper contains 35 sections, 2 equations, 13 figures, 10 tables, 2 algorithms.

Figures (13)

  • Figure 1: Overview of CASHER, we propose a system for training generalist policies leveraging real-to-sim simulation on crowdsourced scans. These have zero-shot transfer and scanned fine-tuning capabilities.
  • Figure 2: Overview of the proposed continual data collection system for amortizing human data collection.
  • Figure 3: a) CASHER's zero-shot scaling laws on the task of pick and placing bowl/cup/mugs to sinks; b) in the proposed real-to-sim-to-real setup there is a linear relation between performance in sim and performance in real; c) evaluation on a broader set of environments confirms the robustness of the zero-shot policies.
  • Figure 4: a) CASHER with continual data collection becomes more efficient in number of human demos and achieves higher performance than running CASHER uniquely from human demos. b) with continual data collection the number of human demos required decreases throughout training. c) even though CASHER relies on compute we observe the amount of compute needed also tends to decrease when scaling up this process.
  • Figure 5: left: results for few-shot fine-tuning on the task of pick and place a box on a shelfmiddle: results opening a cabinet right: multi-object evaluation results on the task of pick and place mug/bowl/cups in the sink
  • ...and 8 more figures