Domain Randomization and Generative Models for Robotic Grasping
Joshua Tobin, Lukas Biewald, Rocky Duan, Marcin Andrychowicz, Ankur Handa, Vikash Kumar, Bob McGrew, Jonas Schneider, Peter Welinder, Wojciech Zaremba, Pieter Abbeel
TL;DR
This work tackles generalization in robotic grasping by leveraging domain randomization through procedurally generated objects and introducing an autoregressive grasp distribution that maps scene observations to high-likelihood grasps. The approach enables efficient test-time sampling and robust sim-to-real transfer, achieving over 90% success on unseen realistic objects in simulation and about 80% in real-world robot trials without any real-object training data. Key contributions include a novel data-generation pipeline, a two-network architecture combining a depth-encoder, autoregressive grasp sampler, and a separate grasp-evaluator, plus a training regimen that accommodates non-differentiable evaluation. The results demonstrate strong generalization and practical potential for scalable, data-driven grasping in diverse environments, with clear pathways for future enhancements and broader application to other robotic tasks.
Abstract
Deep learning-based robotic grasping has made significant progress thanks to algorithmic improvements and increased data availability. However, state-of-the-art models are often trained on as few as hundreds or thousands of unique object instances, and as a result generalization can be a challenge. In this work, we explore a novel data generation pipeline for training a deep neural network to perform grasp planning that applies the idea of domain randomization to object synthesis. We generate millions of unique, unrealistic procedurally generated objects, and train a deep neural network to perform grasp planning on these objects. Since the distribution of successful grasps for a given object can be highly multimodal, we propose an autoregressive grasp planning model that maps sensor inputs of a scene to a probability distribution over possible grasps. This model allows us to sample grasps efficiently at test time (or avoid sampling entirely). We evaluate our model architecture and data generation pipeline in simulation and the real world. We find we can achieve a $>$90% success rate on previously unseen realistic objects at test time in simulation despite having only been trained on random objects. We also demonstrate an 80% success rate on real-world grasp attempts despite having only been trained on random simulated objects.
