Table of Contents
Fetching ...

Domain Randomization and Generative Models for Robotic Grasping

Joshua Tobin, Lukas Biewald, Rocky Duan, Marcin Andrychowicz, Ankur Handa, Vikash Kumar, Bob McGrew, Jonas Schneider, Peter Welinder, Wojciech Zaremba, Pieter Abbeel

TL;DR

This work tackles generalization in robotic grasping by leveraging domain randomization through procedurally generated objects and introducing an autoregressive grasp distribution that maps scene observations to high-likelihood grasps. The approach enables efficient test-time sampling and robust sim-to-real transfer, achieving over 90% success on unseen realistic objects in simulation and about 80% in real-world robot trials without any real-object training data. Key contributions include a novel data-generation pipeline, a two-network architecture combining a depth-encoder, autoregressive grasp sampler, and a separate grasp-evaluator, plus a training regimen that accommodates non-differentiable evaluation. The results demonstrate strong generalization and practical potential for scalable, data-driven grasping in diverse environments, with clear pathways for future enhancements and broader application to other robotic tasks.

Abstract

Deep learning-based robotic grasping has made significant progress thanks to algorithmic improvements and increased data availability. However, state-of-the-art models are often trained on as few as hundreds or thousands of unique object instances, and as a result generalization can be a challenge. In this work, we explore a novel data generation pipeline for training a deep neural network to perform grasp planning that applies the idea of domain randomization to object synthesis. We generate millions of unique, unrealistic procedurally generated objects, and train a deep neural network to perform grasp planning on these objects. Since the distribution of successful grasps for a given object can be highly multimodal, we propose an autoregressive grasp planning model that maps sensor inputs of a scene to a probability distribution over possible grasps. This model allows us to sample grasps efficiently at test time (or avoid sampling entirely). We evaluate our model architecture and data generation pipeline in simulation and the real world. We find we can achieve a $>$90% success rate on previously unseen realistic objects at test time in simulation despite having only been trained on random objects. We also demonstrate an 80% success rate on real-world grasp attempts despite having only been trained on random simulated objects.

Domain Randomization and Generative Models for Robotic Grasping

TL;DR

This work tackles generalization in robotic grasping by leveraging domain randomization through procedurally generated objects and introducing an autoregressive grasp distribution that maps scene observations to high-likelihood grasps. The approach enables efficient test-time sampling and robust sim-to-real transfer, achieving over 90% success on unseen realistic objects in simulation and about 80% in real-world robot trials without any real-object training data. Key contributions include a novel data-generation pipeline, a two-network architecture combining a depth-encoder, autoregressive grasp sampler, and a separate grasp-evaluator, plus a training regimen that accommodates non-differentiable evaluation. The results demonstrate strong generalization and practical potential for scalable, data-driven grasping in diverse environments, with clear pathways for future enhancements and broader application to other robotic tasks.

Abstract

Deep learning-based robotic grasping has made significant progress thanks to algorithmic improvements and increased data availability. However, state-of-the-art models are often trained on as few as hundreds or thousands of unique object instances, and as a result generalization can be a challenge. In this work, we explore a novel data generation pipeline for training a deep neural network to perform grasp planning that applies the idea of domain randomization to object synthesis. We generate millions of unique, unrealistic procedurally generated objects, and train a deep neural network to perform grasp planning on these objects. Since the distribution of successful grasps for a given object can be highly multimodal, we propose an autoregressive grasp planning model that maps sensor inputs of a scene to a probability distribution over possible grasps. This model allows us to sample grasps efficiently at test time (or avoid sampling entirely). We evaluate our model architecture and data generation pipeline in simulation and the real world. We find we can achieve a 90% success rate on previously unseen realistic objects at test time in simulation despite having only been trained on random objects. We also demonstrate an 80% success rate on real-world grasp attempts despite having only been trained on random simulated objects.

Paper Structure

This paper contains 19 sections, 4 equations, 10 figures.

Figures (10)

  • Figure 1: An overview of our approach. Since creating large numbers of realistic object models is challenging, we train our deep autoregressive model architecture on millions of unrealistic procedurally generated objects (indicated in blue above) and billions of unique grasp attempts. At test time, our model generalizes to realistic objects from the YCB dataset (indicated in green above) calli2015ycb with 92% success rate.
  • Figure 2: Examples of objects used in our experiments. Left: procedurally generated random objects. Middle: objects from the ShapeNet object dataset. Right: objects from the YCB object dataset.
  • Figure 3: An overview of sampling from our model architecture. Solid lines represent neural networks, and dotted lines represent sampling operations. The model takes as input one or more observations of the target object in the form of depth images. The images are passed to an image representation module $\alpha$, which maps the images to an embedding $s$. The embedding $s$ is the input for the autoregressive module $\beta$, which outputs a distribution over possible grasps $\boldsymbol{g}$ for the object by modeling each dimension $\boldsymbol{g}_i$ of the grasp conditioned on the previous dimensions. We sample $k$ high-likelihood grasps $\vec{g}^1, \cdots \vec{g}^k$ from the model using a beam search. For each of those grasps, a second observation is captured that corresponds to an aligned image in the plane of the potential grasp. A grasp scoring model $f$ maps each aligned image to a score. The grasp with the highest score is selected for execution on the robot.
  • Figure 4: Performance of the algorithm on different synthetic test sets. The full algorithm is able to achieve at least 90% success on previously unseen objects from the YCB dataset when trained on any of the three training sets.
  • Figure 5: Performance of the algorithm compared to baseline approaches. The Full Algorithm and Autoregressive-Only numbers reported are using models trained on random data. The Autoregressive-Only baseline uses the model $\gamma$ to sample a single high-likelihood grasp, and executes that grasp directly without evaluating it with the model $f$. The Random baseline samples a random grasp. The centroid baseline deterministically attempts to grasps the center of mass of the object, with the approach angle sampled randomly.
  • ...and 5 more figures