Table of Contents
Fetching ...

R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation

Xiuwei Xu, Angyuan Ma, Hankun Li, Bingyao Yu, Zheng Zhu, Jie Zhou, Jiwen Lu

TL;DR

R2RGen introduces a simulator-free real-to-real 3D data generation framework for spatially generalized robotic manipulation. From a single human demonstration, it parses scene geometry and trajectories, applies group-wise augmentations that preserve multi-object relations, and employs camera-aware post-processing to align augmented data with real RGB-D sensor distributions. Real-world experiments across eight tasks show that policies trained with R2RGen-generated data achieve strong spatial generalization, often matching or exceeding performance obtained with many more human demonstrations, and extend to appearance generalization and mobile manipulation. The approach promises scalable, plug-and-play deployment of visuomotor policies in mobile robots, with limitations acknowledged and avenues for future work identified.

Abstract

Towards the aim of generalized robotic manipulation, spatial generalization is the most fundamental capability that requires the policy to work robustly under different spatial distribution of objects, environment and agent itself. To achieve this, substantial human demonstrations need to be collected to cover different spatial configurations for training a generalized visuomotor policy via imitation learning. Prior works explore a promising direction that leverages data generation to acquire abundant spatially diverse data from minimal source demonstrations. However, most approaches face significant sim-to-real gap and are often limited to constrained settings, such as fixed-base scenarios and predefined camera viewpoints. In this paper, we propose a real-to-real 3D data generation framework (R2RGen) that directly augments the pointcloud observation-action pairs to generate real-world data. R2RGen is simulator- and rendering-free, thus being efficient and plug-and-play. Specifically, given a single source demonstration, we introduce an annotation mechanism for fine-grained parsing of scene and trajectory. A group-wise augmentation strategy is proposed to handle complex multi-object compositions and diverse task constraints. We further present camera-aware processing to align the distribution of generated data with real-world 3D sensor. Empirically, R2RGen substantially enhances data efficiency on extensive experiments and demonstrates strong potential for scaling and application on mobile manipulation.

R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation

TL;DR

R2RGen introduces a simulator-free real-to-real 3D data generation framework for spatially generalized robotic manipulation. From a single human demonstration, it parses scene geometry and trajectories, applies group-wise augmentations that preserve multi-object relations, and employs camera-aware post-processing to align augmented data with real RGB-D sensor distributions. Real-world experiments across eight tasks show that policies trained with R2RGen-generated data achieve strong spatial generalization, often matching or exceeding performance obtained with many more human demonstrations, and extend to appearance generalization and mobile manipulation. The approach promises scalable, plug-and-play deployment of visuomotor policies in mobile robots, with limitations acknowledged and avenues for future work identified.

Abstract

Towards the aim of generalized robotic manipulation, spatial generalization is the most fundamental capability that requires the policy to work robustly under different spatial distribution of objects, environment and agent itself. To achieve this, substantial human demonstrations need to be collected to cover different spatial configurations for training a generalized visuomotor policy via imitation learning. Prior works explore a promising direction that leverages data generation to acquire abundant spatially diverse data from minimal source demonstrations. However, most approaches face significant sim-to-real gap and are often limited to constrained settings, such as fixed-base scenarios and predefined camera viewpoints. In this paper, we propose a real-to-real 3D data generation framework (R2RGen) that directly augments the pointcloud observation-action pairs to generate real-world data. R2RGen is simulator- and rendering-free, thus being efficient and plug-and-play. Specifically, given a single source demonstration, we introduce an annotation mechanism for fine-grained parsing of scene and trajectory. A group-wise augmentation strategy is proposed to handle complex multi-object compositions and diverse task constraints. We further present camera-aware processing to align the distribution of generated data with real-world 3D sensor. Empirically, R2RGen substantially enhances data efficiency on extensive experiments and demonstrates strong potential for scaling and application on mobile manipulation.

Paper Structure

This paper contains 25 sections, 9 equations, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: R2RGen is a simulator-free data generation framework. Given one human-collected demonstration, R2RGen directly parses and edits both pointcloud observations and action trajectories in a shared 3D space. R2RGen achieves strong spatial generalization on diverse complex tasks.
  • Figure 2: Pre-processing results. The 3D scene is parsed into complete objects, environment and robot's arm. The trajectory is parsed into interleaved motion and skill segments.
  • Figure 3: The pipeline of R2RGen. Given processed source demonstration, we backtrack skills and apply group-wise augmentation to maintain the spatial relationships among target objects, where a fixed object set is maintained to judge whether the augmentation is applicable. Then motion planning is performed to generate trajectories that connect adjacent skills. After augmentation, we perform camera-aware processing to make the pointclouds follow distribution of RGB-D camera. The solid arrows indicate the processing flow, while the dashed arrows indicate the updating of fixed object set.
  • Figure 4: Visualization of our real-world tasks. We show the start and end moments of each task.
  • Figure 5: Effects of the number of generated demonstrations and source demonstrations on the final performance of R2RGen.
  • ...and 8 more figures