ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

Zifan Xu; Ran Gong; Maria Vittoria Minniti; Ahmet Salih Gundogdu; Eric Rosen; Kausik Sivakumar; Riedana Yan; Zixing Wang; Di Deng; Peter Stone; Xiaohan Zhang; Karl Schmeckpeper

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

Zifan Xu, Ran Gong, Maria Vittoria Minniti, Ahmet Salih Gundogdu, Eric Rosen, Kausik Sivakumar, Riedana Yan, Zixing Wang, Di Deng, Peter Stone, Xiaohan Zhang, Karl Schmeckpeper

Abstract

Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to enable scalable sim-to-real transfer. ExpertGen first initializes a behavior prior using a diffusion policy trained on imperfect demonstrations, which may be synthesized by large language models or provided by humans. Reinforcement learning is then used to steer this prior toward high task success by optimizing the diffusion model's initial noise while keep original policy frozen. By keeping the pretrained diffusion policy frozen, ExpertGen regularizes exploration to remain within safe, human-like behavior manifolds, while also enabling effective learning with only sparse rewards. Empirical evaluations on challenging manipulation benchmarks demonstrate that ExpertGen reliably produces high-quality expert policies with no reward engineering. On industrial assembly tasks, ExpertGen achieves a 90.5% overall success rate, while on long-horizon manipulation tasks it attains 85% overall success, outperforming all baseline methods. The resulting policies exhibit dexterous control and remain robust across diverse initial configurations and failure states. To validate sim-to-real transfer, the learned state-based expert policies are further distilled into visuomotor policies via DAgger and successfully deployed on real robotic hardware.

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

Abstract

Paper Structure (58 sections, 21 equations, 21 figures, 15 tables)

This paper contains 58 sections, 21 equations, 21 figures, 15 tables.

Introduction
Related Work
Offline-to-Online Reinforcement Learning
Synthetic Robotics Data Generation
Preliminaries
Constrained Markov Decision Process (CMDP)
Diffusion Policy
ExpertGen
Generative Behavior Prior Modeling
Expert Policy Acquisition
Diffusion Steering Reinforcement Learning
Off-Policy RL with FastTD3
Visuomotor Policy Distillation through DAgger
Experiment Setup
Benchmarks (\ref{['fig:benchmarks']})
...and 43 more sections

Figures (21)

Figure 1: ExpertGen pipeline: (left) generative modeling of imperfect behavior priors; (middle) steering prior model in massively parallel simulation using reinforcement learning; and (right) visual distillation in simulation with DAgger for zero-shot sim-to-real transfer.
Figure 2: ExpertGen training pipeline: generative modeling of imperfect behavior priors using a state-based diffusion policy (Phase 1); steering diffusion policy in massive parallel simulation using FastTD3 (phase 2); visual policy distillation using DAgger from expert teachers after steering (phase 3).
Figure 3: The success rates (%) of the evaluated approaches on selected assets from AutoMate benchmark. ExpertGen outperforms all other baselines with an overall success of 90.5%. By introducing x-y noise, the BC policy demonstrates better state coverage and higher success rates compared to no x-y noise counterpart.
Figure 4: Illustrations of the tasks in our experiments. (A) Real-world manipulation tasks. (B) Industrial assemble tasks from AutoMate. (C) Long-horizon tasks from AnyTask.
Figure 5: Ablation of DAgger vs. BC in simulation. We plot the student policy's evaluation success rate against the number of online training samples. Training with DAgger is important for efficiently achieving higher success rates.
...and 16 more figures

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

Abstract

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

Authors

Abstract

Table of Contents

Figures (21)