Table of Contents
Fetching ...

Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects

Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, Stan Birchfield

TL;DR

This work tackles the problem of estimating six-degree-of-freedom ($6$-DoF) pose for known rigid objects from a single RGB image in clutter to enable real-time semantic grasping. It introduces DOPE, a two-stage pipeline where a multi-stage convnet outputs 2D keypoint belief maps and vector fields, and a PnP step recovers each object’s pose; the network is trained entirely on synthetic data generated by combining domain randomization with photorealistic FAT scenes in Unreal Engine 4 via the NDDS plugin. Results show that synthetic-only training can match or exceed models trained on real data, generalize to extreme lighting, and support real-time robotic manipulation with a Baxter robot. The approach reduces reliance on labeled real data and offers robust pose estimation across varied environments, with potential impact on household object manipulation and service robotics.

Abstract

Using synthetic data for training deep neural networks for robotic manipulation holds the promise of an almost unlimited amount of pre-labeled training data, generated safely out of harm's way. One of the key challenges of synthetic data, to date, has been to bridge the so-called reality gap, so that networks trained on synthetic data operate correctly when exposed to real-world data. We explore the reality gap in the context of 6-DoF pose estimation of known objects from a single RGB image. We show that for this problem the reality gap can be successfully spanned by a simple combination of domain randomized and photorealistic data. Using synthetic data generated in this manner, we introduce a one-shot deep neural network that is able to perform competitively against a state-of-the-art network trained on a combination of real and synthetic data. To our knowledge, this is the first deep network trained only on synthetic data that is able to achieve state-of-the-art performance on 6-DoF object pose estimation. Our network also generalizes better to novel environments including extreme lighting conditions, for which we show qualitative results. Using this network we demonstrate a real-time system estimating object poses with sufficient accuracy for real-world semantic grasping of known household objects in clutter by a real robot.

Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects

TL;DR

This work tackles the problem of estimating six-degree-of-freedom (-DoF) pose for known rigid objects from a single RGB image in clutter to enable real-time semantic grasping. It introduces DOPE, a two-stage pipeline where a multi-stage convnet outputs 2D keypoint belief maps and vector fields, and a PnP step recovers each object’s pose; the network is trained entirely on synthetic data generated by combining domain randomization with photorealistic FAT scenes in Unreal Engine 4 via the NDDS plugin. Results show that synthetic-only training can match or exceed models trained on real data, generalize to extreme lighting, and support real-time robotic manipulation with a Baxter robot. The approach reduces reliance on labeled real data and offers robust pose estimation across varied environments, with potential impact on household object manipulation and service robotics.

Abstract

Using synthetic data for training deep neural networks for robotic manipulation holds the promise of an almost unlimited amount of pre-labeled training data, generated safely out of harm's way. One of the key challenges of synthetic data, to date, has been to bridge the so-called reality gap, so that networks trained on synthetic data operate correctly when exposed to real-world data. We explore the reality gap in the context of 6-DoF pose estimation of known objects from a single RGB image. We show that for this problem the reality gap can be successfully spanned by a simple combination of domain randomized and photorealistic data. Using synthetic data generated in this manner, we introduce a one-shot deep neural network that is able to perform competitively against a state-of-the-art network trained on a combination of real and synthetic data. To our knowledge, this is the first deep network trained only on synthetic data that is able to achieve state-of-the-art performance on 6-DoF object pose estimation. Our network also generalizes better to novel environments including extreme lighting conditions, for which we show qualitative results. Using this network we demonstrate a real-time system estimating object poses with sufficient accuracy for real-world semantic grasping of known household objects in clutter by a real robot.

Paper Structure

This paper contains 14 sections, 5 figures.

Figures (5)

  • Figure 1: Example images from our domain randomized (left) and photorealistic (right) datasets used for training.
  • Figure 2: Accuracy-threshold curves for our DOPE method compared with PoseCNN xiang2018rss:posecnn for 5 YCB objects on the YCB-Video dataset. Shown are versions of our method trained using domain-randomized data only (DR), synthetic photorealistic data only (photo), and both (DR+photo). The numbers in the legend display the area under the curve (AUC). The vertical dashed line indicates the threshold corresponding approximately to the level of accuracy necessary for grasping using our robotic manipulator (2 cm). Our method (blue curve) yields the best results for 4 out of 5 objects.
  • Figure 3: Pose estimation of YCB objects on data showing extreme lighting conditions. Top: PoseCNN xiang2018rss:posecnn, which was trained on a mixture of synthetic data and real data from the YCB-Video dataset xiang2018rss:posecnn, struggles to generalize to this scenario captured with a different camera, extreme poses, severe occlusion, and extreme lighting changes. Bottom: Our proposed DOPE method generalizes to these extreme real-world conditions even though it was trained only on synthetic data; all objects are detected except the severely occluded soup can (2nd column) and three dark cans (3rd column).
  • Figure 4: Accuracy-threshold curves with various numbers of stages, showing the benefit of additional stages to resolve ambiguity from earlier stages. The table shows the total execution time, including object extraction and PnP, and performance of the system for different numbers of stages.
  • Figure 5: Robotic pick-and-place of a potted meat can on a cracker box. Note that the can is initially resting on another object rather than on the table, and that the destination box is not required to be aligned with the table, since the system estimates full 6-DoF pose of all objects. Note also that the can is aligned with the box (as desired) and within a couple centimeters of the center of the box.