Table of Contents
Fetching ...

RoboEngine: Plug-and-Play Robot Data Augmentation with Semantic Robot Segmentation and Background Generation

Chengbo Yuan, Suraj Joshi, Shaoting Zhu, Hang Su, Hang Zhao, Yang Gao

TL;DR

RoboEngine tackles the fragility of visuomotor imitation learning due to visual disturbances by introducing a calibration-free, plug-and-play data augmentation pipeline. It combines RoboSeg-based fine-grained robot segmentation (Robo-SAM) with a task-aware background generator (BackGround-Diffusion) to produce physically feasible, diverse robot scenes from demonstrations in a single scene, enabling zero-shot generalization to six new scenes. The approach achieves substantial improvements over no-augmentation baselines and competitive gains against prior augmentation methods, validated through both segmentation metrics and real-robot policy evaluation. By releasing RoboSeg, Robo-SAM, and the end-to-end RoboEngine toolkit, the work provides a practical, scalable solution to enhance visual robustness in robotic imitation learning for broader adoption.

Abstract

Visual augmentation has become a crucial technique for enhancing the visual robustness of imitation learning. However, existing methods are often limited by prerequisites such as camera calibration or the need for controlled environments (e.g., green screen setups). In this work, we introduce RoboEngine, the first plug-and-play visual robot data augmentation toolkit. For the first time, users can effortlessly generate physics- and task-aware robot scenes with just a few lines of code. To achieve this, we present a novel robot scene segmentation dataset, a generalizable high-quality robot segmentation model, and a fine-tuned background generation model, which together form the core components of the out-of-the-box toolkit. Using RoboEngine, we demonstrate the ability to generalize robot manipulation tasks across six entirely new scenes, based solely on demonstrations collected from a single scene, achieving a more than 200% performance improvement compared to the no-augmentation baseline. All datasets, model weights, and the toolkit are released https://roboengine.github.io/

RoboEngine: Plug-and-Play Robot Data Augmentation with Semantic Robot Segmentation and Background Generation

TL;DR

RoboEngine tackles the fragility of visuomotor imitation learning due to visual disturbances by introducing a calibration-free, plug-and-play data augmentation pipeline. It combines RoboSeg-based fine-grained robot segmentation (Robo-SAM) with a task-aware background generator (BackGround-Diffusion) to produce physically feasible, diverse robot scenes from demonstrations in a single scene, enabling zero-shot generalization to six new scenes. The approach achieves substantial improvements over no-augmentation baselines and competitive gains against prior augmentation methods, validated through both segmentation metrics and real-robot policy evaluation. By releasing RoboSeg, Robo-SAM, and the end-to-end RoboEngine toolkit, the work provides a practical, scalable solution to enhance visual robustness in robotic imitation learning for broader adoption.

Abstract

Visual augmentation has become a crucial technique for enhancing the visual robustness of imitation learning. However, existing methods are often limited by prerequisites such as camera calibration or the need for controlled environments (e.g., green screen setups). In this work, we introduce RoboEngine, the first plug-and-play visual robot data augmentation toolkit. For the first time, users can effortlessly generate physics- and task-aware robot scenes with just a few lines of code. To achieve this, we present a novel robot scene segmentation dataset, a generalizable high-quality robot segmentation model, and a fine-tuned background generation model, which together form the core components of the out-of-the-box toolkit. Using RoboEngine, we demonstrate the ability to generalize robot manipulation tasks across six entirely new scenes, based solely on demonstrations collected from a single scene, achieving a more than 200% performance improvement compared to the no-augmentation baseline. All datasets, model weights, and the toolkit are released https://roboengine.github.io/

Paper Structure

This paper contains 19 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 2: Our RoboSeg dataset provides high-quality, fine-grained semantic segmentation annotations, covering a wide diversity of robots and environments.
  • Figure 3: Comparison of Segmentation Results between our new Robo-SAM model and other baselines. Only Robo-SAM produces usable segmentation masks for downstream augmentation applications.
  • Figure 4: (a) Augmentation results using different methods. RoboEngine is the only method that simultaneously satisfies both physics constraints and high visual diversity. (b) Visualization of real robot evaluation environment. All scenes exhibit significant visual differences from the scene used for data collection. Video visualizations are available https://roboengine.github.io/.
  • Figure 5: Performance scaling trend experiment results. We report the average behavior score (un-normalized) for the "Fold Towel (Finish)" task across 3 novel scenes. Raw results can be found in Appendix \ref{['app: scaling_exp']}.