Table of Contents
Fetching ...

Semantically Controllable Augmentations for Generalizable Robot Learning

Zoey Chen, Zhao Mandi, Homanga Bharadhwaj, Mohit Sharma, Shuran Song, Abhishek Gupta, Vikash Kumar

TL;DR

A generative augmentation framework for semantically controllable augmentations and rapidly multiplying robot datasets while inducing rich variations that enable real-world generalization and demonstrates the effectiveness of image-text generative models in diverse real-world robotic applications.

Abstract

Generalization to unseen real-world scenarios for robot manipulation requires exposure to diverse datasets during training. However, collecting large real-world datasets is intractable due to high operational costs. For robot learning to generalize despite these challenges, it is essential to leverage sources of data or priors beyond the robot's direct experience. In this work, we posit that image-text generative models, which are pre-trained on large corpora of web-scraped data, can serve as such a data source. These generative models encompass a broad range of real-world scenarios beyond a robot's direct experience and can synthesize novel synthetic experiences that expose robotic agents to additional world priors aiding real-world generalization at no extra cost. In particular, our approach leverages pre-trained generative models as an effective tool for data augmentation. We propose a generative augmentation framework for semantically controllable augmentations and rapidly multiplying robot datasets while inducing rich variations that enable real-world generalization. Based on diverse augmentations of robot data, we show how scalable robot manipulation policies can be trained and deployed both in simulation and in unseen real-world environments such as kitchens and table-tops. By demonstrating the effectiveness of image-text generative models in diverse real-world robotic applications, our generative augmentation framework provides a scalable and efficient path for boosting generalization in robot learning at no extra human cost.

Semantically Controllable Augmentations for Generalizable Robot Learning

TL;DR

A generative augmentation framework for semantically controllable augmentations and rapidly multiplying robot datasets while inducing rich variations that enable real-world generalization and demonstrates the effectiveness of image-text generative models in diverse real-world robotic applications.

Abstract

Generalization to unseen real-world scenarios for robot manipulation requires exposure to diverse datasets during training. However, collecting large real-world datasets is intractable due to high operational costs. For robot learning to generalize despite these challenges, it is essential to leverage sources of data or priors beyond the robot's direct experience. In this work, we posit that image-text generative models, which are pre-trained on large corpora of web-scraped data, can serve as such a data source. These generative models encompass a broad range of real-world scenarios beyond a robot's direct experience and can synthesize novel synthetic experiences that expose robotic agents to additional world priors aiding real-world generalization at no extra cost. In particular, our approach leverages pre-trained generative models as an effective tool for data augmentation. We propose a generative augmentation framework for semantically controllable augmentations and rapidly multiplying robot datasets while inducing rich variations that enable real-world generalization. Based on diverse augmentations of robot data, we show how scalable robot manipulation policies can be trained and deployed both in simulation and in unseen real-world environments such as kitchens and table-tops. By demonstrating the effectiveness of image-text generative models in diverse real-world robotic applications, our generative augmentation framework provides a scalable and efficient path for boosting generalization in robot learning at no extra human cost.
Paper Structure (41 sections, 1 equation, 29 figures, 4 tables)

This paper contains 41 sections, 1 equation, 29 figures, 4 tables.

Figures (29)

  • Figure 1: Our Framework takes a small offline dataset containing expert demonstrations and leverages text-to-image generative models to semantically bootstrap the initial dataset into a much larger and diverse augmented dataset, which can be used to train a robot policy that generalizes to unseen environments and tasks.
  • Figure 2: Our framework provides the ability to augment the scene by changing the object texture (first row), changing the background (second row), adding distractors (third row) and changing object categories (fourth row)
  • Figure 3: In the low-data regime, our semantic augmentation is controllable and allows more physically plausible augmentations. We augment the scene in both the RGB and depth information while preserving visual coherence between the RGB and depth modalities. This approach enhances the versatility of the augmentation pipeline, making it suitable for a wide range of methods that utilize RGBD data format.
  • Figure 4: We leverage 3D object assets and simulation, and use text-to-image diffusion models to generate a visually realistic appearance while updating the original depth map, resulting in geometric consistent augmentation
  • Figure 5: Scalable augmentation in unstructured environment from video trajectories. (a) We use the location of the endeffector to prompt SAM kirillov2023segment and get interaction object mask for inpainting, and keep it consistent across video frames using TrackAnything trackanything. Images inside the black box show the original frame with the object mask predicted by SAM. (b) We track the masks for the robot and interaction objects, and randomly inpaint regions in the background returned by SAM, resulting in diverse background augmentation across frames.
  • ...and 24 more figures