Table of Contents
Fetching ...

ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis

Yu Fang, Yue Yang, Xinghao Zhu, Kaiyuan Zheng, Gedas Bertasius, Daniel Szafir, Mingyu Ding

TL;DR

This work addresses the data bottleneck in scaling vision-language-action (VLA) models for robot manipulation by introducing ReBot, a real-to-sim-to-real pipeline. It replays real robot trajectories in simulation to diversify objects (real-to-sim), then fuses these simulated motions with inpainted real backgrounds to produce physically realistic, temporally coherent videos (sim-to-real), enabling automated domain adaptation of VLA models. Three core components—trajectory replay, background inpainting, and video synthesis—yield high-quality synthetic data that improve in-domain performance and generalization of models like Octo and OpenVLA in both simulated and real-world tasks. Extensive experiments demonstrate significant gains over baselines, with notable improvements in real-world success rates for Franka Panda tasks, highlighting ReBot’s practical impact for scalable robotic learning and deployment.

Abstract

Vision-language-action (VLA) models present a promising paradigm by training policies directly on real robot datasets like Open X-Embodiment. However, the high cost of real-world data collection hinders further data scaling, thereby restricting the generalizability of VLAs. In this paper, we introduce ReBot, a novel real-to-sim-to-real approach for scaling real robot datasets and adapting VLA models to target domains, which is the last-mile deployment challenge in robot manipulation. Specifically, ReBot replays real-world robot trajectories in simulation to diversify manipulated objects (real-to-sim), and integrates the simulated movements with inpainted real-world background to synthesize physically realistic and temporally consistent robot videos (sim-to-real). Our approach has several advantages: 1) it enjoys the benefit of real data to minimize the sim-to-real gap; 2) it leverages the scalability of simulation; and 3) it can generalize a pretrained VLA to a target domain with fully automated data pipelines. Extensive experiments in both simulation and real-world environments show that ReBot significantly enhances the performance and robustness of VLAs. For example, in SimplerEnv with the WidowX robot, ReBot improved the in-domain performance of Octo by 7.2% and OpenVLA by 21.8%, and out-of-domain generalization by 19.9% and 9.4%, respectively. For real-world evaluation with a Franka robot, ReBot increased the success rates of Octo by 17% and OpenVLA by 20%. More information can be found at: https://yuffish.github.io/rebot/

ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis

TL;DR

This work addresses the data bottleneck in scaling vision-language-action (VLA) models for robot manipulation by introducing ReBot, a real-to-sim-to-real pipeline. It replays real robot trajectories in simulation to diversify objects (real-to-sim), then fuses these simulated motions with inpainted real backgrounds to produce physically realistic, temporally coherent videos (sim-to-real), enabling automated domain adaptation of VLA models. Three core components—trajectory replay, background inpainting, and video synthesis—yield high-quality synthetic data that improve in-domain performance and generalization of models like Octo and OpenVLA in both simulated and real-world tasks. Extensive experiments demonstrate significant gains over baselines, with notable improvements in real-world success rates for Franka Panda tasks, highlighting ReBot’s practical impact for scalable robotic learning and deployment.

Abstract

Vision-language-action (VLA) models present a promising paradigm by training policies directly on real robot datasets like Open X-Embodiment. However, the high cost of real-world data collection hinders further data scaling, thereby restricting the generalizability of VLAs. In this paper, we introduce ReBot, a novel real-to-sim-to-real approach for scaling real robot datasets and adapting VLA models to target domains, which is the last-mile deployment challenge in robot manipulation. Specifically, ReBot replays real-world robot trajectories in simulation to diversify manipulated objects (real-to-sim), and integrates the simulated movements with inpainted real-world background to synthesize physically realistic and temporally consistent robot videos (sim-to-real). Our approach has several advantages: 1) it enjoys the benefit of real data to minimize the sim-to-real gap; 2) it leverages the scalability of simulation; and 3) it can generalize a pretrained VLA to a target domain with fully automated data pipelines. Extensive experiments in both simulation and real-world environments show that ReBot significantly enhances the performance and robustness of VLAs. For example, in SimplerEnv with the WidowX robot, ReBot improved the in-domain performance of Octo by 7.2% and OpenVLA by 21.8%, and out-of-domain generalization by 19.9% and 9.4%, respectively. For real-world evaluation with a Franka robot, ReBot increased the success rates of Octo by 17% and OpenVLA by 20%. More information can be found at: https://yuffish.github.io/rebot/

Paper Structure

This paper contains 12 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: An overview of ReBot. We propose ReBot, a novel real-to-sim-to-real approach for scaling real robot datasets. ReBot replays real-world robot trajectories in a simulation environment to diversify manipulated objects (real-to-sim), and integrates the simulated movements with inpainted real-world background to produce realistic synthetic videos (sim-to-real), effectively adapting VLA models to target domains.
  • Figure 2: An overview of our framework. ReBot includes three key components: A) Real-to-Sim Trajectory Replay: For each real-world episode, we automatically set up digital twins and replay the real-world trajectory to obtain simulated movements for manipulating new objects. Each trajectory can be reused for different objects. B) Real-world Background Inpainting: To obtain task-agnostic real-world background for video synthesis, we introduce an automated inpainting module to segment and remove the robot and object from the original real-world video. C) Sim-to-Real Video Synthesis: We eventually integrate simulated movements with task-agnostic real-world background to produce synthetic videos. ReBot is fully automated and requires no manual intervention.
  • Figure 3: Comparison of synthetic videos. We show examples from three datasets: DROID (left), BridgeData V2 (mid), and our dataset (right). ReBot generates realistic videos with physically plausible movements and excellent temporal consistency, significantly outperforming ROSIE.
  • Figure 4: Quantitative comparison of generated video quality. We report VBench scores as evaluation metrics. ReBot outperforms ROSIE and achieves video quality comparable to original real-world videos.
  • Figure 5: Comparisons of multi-view consistency. We present two examples from the DROID dataset, each captured from two different camera views. While ROSIE lacks multi-view consistency, ReBot naturally preserves this capability inherited from 3D simulation, ensuring the same object in different camera views, as in the real world.
  • ...and 2 more figures