RoboAug: One Annotation to Hundreds of Scenes via Region-Contrastive Data Augmentation for Robotic Manipulation

Xinhua Wang; Kun Wu; Zhen Zhao; Hu Cao; Yinuo Zhao; Zhiyuan Xu; Meng Li; Shichao Fan; Di Wu; Yixue Zhang; Ning Liu; Zhengping Che; Jian Tang

RoboAug: One Annotation to Hundreds of Scenes via Region-Contrastive Data Augmentation for Robotic Manipulation

Xinhua Wang, Kun Wu, Zhen Zhao, Hu Cao, Yinuo Zhao, Zhiyuan Xu, Meng Li, Shichao Fan, Di Wu, Yixue Zhang, Ning Liu, Zhengping Che, Jian Tang

TL;DR

Empirical results demonstrate that RoboAug significantly outperforms state-of-the-art data augmentation baselines when evaluating generalization capabilities in unseen scenes featuring diverse combinations of backgrounds, distractors, and lighting conditions, and the method achieves substantial gains over the baseline without augmentation.

Abstract

Enhancing the generalization capability of robotic learning to enable robots to operate effectively in diverse, unseen scenes is a fundamental and challenging problem. Existing approaches often depend on pretraining with large-scale data collection, which is labor-intensive and time-consuming, or on semantic data augmentation techniques that necessitate an impractical assumption of flawless upstream object detection in real-world scenarios. In this work, we propose RoboAug, a novel generative data augmentation framework that significantly minimizes the reliance on large-scale pretraining and the perfect visual recognition assumption by requiring only the bounding box annotation of a single image during training. Leveraging this minimal information, RoboAug employs pre-trained generative models for precise semantic data augmentation and integrates a plug-and-play region-contrastive loss to help models focus on task-relevant regions, thereby improving generalization and boosting task success rates. We conduct extensive real-world experiments on three robots, namely UR-5e, AgileX, and Tien Kung 2.0, spanning over 35k rollouts. Empirical results demonstrate that RoboAug significantly outperforms state-of-the-art data augmentation baselines. Specifically, when evaluating generalization capabilities in unseen scenes featuring diverse combinations of backgrounds, distractors, and lighting conditions, our method achieves substantial gains over the baseline without augmentation. The success rates increase from 0.09 to 0.47 on UR-5e, from 0.16 to 0.60 on AgileX, and from 0.19 to 0.67 on Tien Kung 2.0. These results highlight the superior generalization and effectiveness of RoboAug in real-world manipulation tasks. Our project is available at https://x-roboaug.github.io/.

RoboAug: One Annotation to Hundreds of Scenes via Region-Contrastive Data Augmentation for Robotic Manipulation

TL;DR

Abstract

Paper Structure (31 sections, 3 theorems, 9 equations, 19 figures, 8 tables)

This paper contains 31 sections, 3 theorems, 9 equations, 19 figures, 8 tables.

Introduction
Related Work
Generalization in Visuomotor Policy Learning
Data Augmentation for Robotic Manipulation
Methodology
Overview
Task-Relevant Region Extraction
Semantic Data Augmentation
Region-Contrastive Policy Learning
RoboAug-D Dataset for Object Detection
Experiments
Object Detection on RoboAug-D Dataset
Real-World Generalizable Robotic Manipulation
Compositional Generalization Evaluation
Single-Factor Generalization Evaluation
...and 16 more sections

Key Result

Theorem 6.1

Let $\mathcal{H}$ be the policy hypothesis class. Assume the loss function $\ell$ is Lipschitz continuous with respect to its first argument with constant $L_\ell$ and is bounded by $c$. For any $\delta > 0$, with probability at least $1 - \delta$ over the draw of a dataset $S$ of size $N$, the foll where $\hat{\mathcal{R}}_S(\pi)$ is the empirical risk on the dataset $S$, and $\mathfrak{R}_{N}(\m

Figures (19)

Figure 1: We introduce RoboAug, a region-contrastive data augmentation framework. RoboAug enables robust robotic generalization in diverse, unseen scenes.
Figure 2: Overview of RoboAug. RoboAug contains three stages: (1) task-relevant region extraction, (2) semantic data augmentation, and (3) region-contrastive policy learning.
Figure 3: Comparison of mAP@0.5 across RoboAug-D Dataset. We present the results of 5 representative objects.
Figure 4: Overview of the generalization evaluation settings, spanning single-factor variations and compositional dual- and triple-factor scenes involving background, distractors, and lighting.
Figure 5: Experimental Setup. We evaluate RoboAug across three robot embodiments.
...and 14 more figures

Theorems & Definitions (3)

Theorem 6.1: Generalization Bound for Loss Functions
Theorem 6.2: Generalization Bound with Semantic Augmentation
Corollary 6.2.1: Complexity Reduction via RCL

RoboAug: One Annotation to Hundreds of Scenes via Region-Contrastive Data Augmentation for Robotic Manipulation

TL;DR

Abstract

RoboAug: One Annotation to Hundreds of Scenes via Region-Contrastive Data Augmentation for Robotic Manipulation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (3)