Table of Contents
Fetching ...

ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation

Jason Chen, I-Chun Arthur Liu, Gaurav Sukhatme, Daniel Seita

Abstract

Training robust bimanual manipulation policies via imitation learning requires demonstration data with broad coverage over robot poses, contacts, and scene contexts. However, collecting diverse and precise real-world demonstrations is costly and time-consuming, which hinders scalability. Prior works have addressed this with data augmentation, typically for either eye-in-hand (wrist camera) setups with RGB inputs or for generating novel images without paired actions, leaving augmentation for eye-to-hand (third-person) RGB-D training with new action labels less explored. In this paper, we propose Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation (ROPA), an offline imitation learning data augmentation method that fine-tunes Stable Diffusion to synthesize third-person RGB and RGB-D observations of novel robot poses. Our approach simultaneously generates corresponding joint-space action labels while employing constrained optimization to enforce physical consistency through appropriate gripper-to-object contact constraints in bimanual scenarios. We evaluate our method on 5 simulated and 3 real-world tasks. Our results across 2625 simulation trials and 300 real-world trials demonstrate that ROPA outperforms baselines and ablations, showing its potential for scalable RGB and RGB-D data augmentation in eye-to-hand bimanual manipulation. Our project website is available at: https://ropaaug.github.io/.

ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation

Abstract

Training robust bimanual manipulation policies via imitation learning requires demonstration data with broad coverage over robot poses, contacts, and scene contexts. However, collecting diverse and precise real-world demonstrations is costly and time-consuming, which hinders scalability. Prior works have addressed this with data augmentation, typically for either eye-in-hand (wrist camera) setups with RGB inputs or for generating novel images without paired actions, leaving augmentation for eye-to-hand (third-person) RGB-D training with new action labels less explored. In this paper, we propose Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation (ROPA), an offline imitation learning data augmentation method that fine-tunes Stable Diffusion to synthesize third-person RGB and RGB-D observations of novel robot poses. Our approach simultaneously generates corresponding joint-space action labels while employing constrained optimization to enforce physical consistency through appropriate gripper-to-object contact constraints in bimanual scenarios. We evaluate our method on 5 simulated and 3 real-world tasks. Our results across 2625 simulation trials and 300 real-world trials demonstrate that ROPA outperforms baselines and ablations, showing its potential for scalable RGB and RGB-D data augmentation in eye-to-hand bimanual manipulation. Our project website is available at: https://ropaaug.github.io/.

Paper Structure

This paper contains 29 sections, 3 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: ROPA performs offline data augmentation for bimanual imitation learning. White arrows indicate pose differences between the original and augmented images. Red regions represent ROPA generated images and states every $k$ timesteps at $t+k$ and $t+2k$, while blue regions show the original dataset. RGB and depth image pairs are captured at the same timesteps, with the top row displaying depth colormap and the bottom row showing standard RGB images.
  • Figure 2: ROPA Overview.(1) The Skeleton Pose Generator takes camera extrinsics and intrinsics, target joint positions, and left and right robot base positions to generate a skeleton pose image $I_{t}^p$ representing the target joint configuration. (2) The source image $I_t^s$ and language goal $g$ are fed into Stable Diffusion (the bottom U-Net model), while the generated skeleton pose serves as control input to ControlNet (the top U-Net model), producing the target image $I_t^{d}$. The locked icons represent frozen parameters. (3) The original dataset is duplicated and generated target states replace the original states every $k$ timesteps (see Section \ref{['ssec:action-labeling-and-dataset-construction']} for more details), with updated corresponding action labels. This augmented dataset is combined with the original dataset to train a bimanual manipulation policy.
  • Figure 3: Skeleton Pose Ablations and Visualization. Comparison of different skeleton pose formats: (1) ROPA's skeleton pose, (2) OpenPose Cao2019OpenPose inspired skeleton pose, and (3) an all white Skeleton Pose (less visual contrast). (4) demonstrates precise alignment between ROPA's skeleton pose and the source image. (5) shows a source image input for multi-view generation, while (6) displays the skeleton pose for both robots overlaid on that same source image.
  • Figure 4: Depth Image Generation. Condensed variation of the pipeline in Fig. \ref{['fig:pipeline']} for depth image synthesis. (1) Source depth colormap image input to Stable Diffusion. (2) RGB target image and skeleton pose provide conditioning inputs to ControlNet. (3) Generated target depth image.
  • Figure 5: Synthesized images in simulation. We present synthesized images from the Coordinated Lift Ball (CLB) task across two timesteps. The blue bordered images show the original RGB and RGB-D images, while the red bordered images represent the generated target image RGB and RGB-D images conditioned on the corresponding skeleton pose shown below.
  • ...and 6 more figures