Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects
Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, Stan Birchfield
TL;DR
This work tackles the problem of estimating six-degree-of-freedom ($6$-DoF) pose for known rigid objects from a single RGB image in clutter to enable real-time semantic grasping. It introduces DOPE, a two-stage pipeline where a multi-stage convnet outputs 2D keypoint belief maps and vector fields, and a PnP step recovers each object’s pose; the network is trained entirely on synthetic data generated by combining domain randomization with photorealistic FAT scenes in Unreal Engine 4 via the NDDS plugin. Results show that synthetic-only training can match or exceed models trained on real data, generalize to extreme lighting, and support real-time robotic manipulation with a Baxter robot. The approach reduces reliance on labeled real data and offers robust pose estimation across varied environments, with potential impact on household object manipulation and service robotics.
Abstract
Using synthetic data for training deep neural networks for robotic manipulation holds the promise of an almost unlimited amount of pre-labeled training data, generated safely out of harm's way. One of the key challenges of synthetic data, to date, has been to bridge the so-called reality gap, so that networks trained on synthetic data operate correctly when exposed to real-world data. We explore the reality gap in the context of 6-DoF pose estimation of known objects from a single RGB image. We show that for this problem the reality gap can be successfully spanned by a simple combination of domain randomized and photorealistic data. Using synthetic data generated in this manner, we introduce a one-shot deep neural network that is able to perform competitively against a state-of-the-art network trained on a combination of real and synthetic data. To our knowledge, this is the first deep network trained only on synthetic data that is able to achieve state-of-the-art performance on 6-DoF object pose estimation. Our network also generalizes better to novel environments including extreme lighting conditions, for which we show qualitative results. Using this network we demonstrate a real-time system estimating object poses with sufficient accuracy for real-world semantic grasping of known household objects in clutter by a real robot.
