Table of Contents
Fetching ...

CenterArt: Joint Shape Reconstruction and 6-DoF Grasp Estimation of Articulated Objects

Sassan Mokhtar, Eugenio Chisari, Nick Heppert, Abhinav Valada

TL;DR

Addresses joint 3D shape reconstruction and 6-DoF grasp estimation for articulated objects from RGB-D. It introduces CenterArt, a vision-based architecture with an image encoder and a SGDF-based decoder that uses shape and joint latent codes to predict geometry and 6-DoF grasps, transforming results into the camera frame. A two-pronged dataset generation strategy builds a large set of valid 6-DoF grasps from PartNet-Mobility objects and realistic Sapien kitchen scenes. Empirical results show CenterArt outperforming the RL-based baseline (UMPNet) by up to $52\%$ SR on simple scenes and demonstrates robustness to depth noise and scene complexity, with an overall improvement of about $28\%$ in SR across tested scenarios.

Abstract

Precisely grasping and reconstructing articulated objects is key to enabling general robotic manipulation. In this paper, we propose CenterArt, a novel approach for simultaneous 3D shape reconstruction and 6-DoF grasp estimation of articulated objects. CenterArt takes RGB-D images of the scene as input and first predicts the shape and joint codes through an encoder. The decoder then leverages these codes to reconstruct 3D shapes and estimate 6-DoF grasp poses of the objects. We further develop a mechanism for generating a dataset of 6-DoF grasp ground truth poses for articulated objects. CenterArt is trained on realistic scenes containing multiple articulated objects with randomized designs, textures, lighting conditions, and realistic depths. We perform extensive experiments demonstrating that CenterArt outperforms existing methods in accuracy and robustness.

CenterArt: Joint Shape Reconstruction and 6-DoF Grasp Estimation of Articulated Objects

TL;DR

Addresses joint 3D shape reconstruction and 6-DoF grasp estimation for articulated objects from RGB-D. It introduces CenterArt, a vision-based architecture with an image encoder and a SGDF-based decoder that uses shape and joint latent codes to predict geometry and 6-DoF grasps, transforming results into the camera frame. A two-pronged dataset generation strategy builds a large set of valid 6-DoF grasps from PartNet-Mobility objects and realistic Sapien kitchen scenes. Empirical results show CenterArt outperforming the RL-based baseline (UMPNet) by up to SR on simple scenes and demonstrates robustness to depth noise and scene complexity, with an overall improvement of about in SR across tested scenarios.

Abstract

Precisely grasping and reconstructing articulated objects is key to enabling general robotic manipulation. In this paper, we propose CenterArt, a novel approach for simultaneous 3D shape reconstruction and 6-DoF grasp estimation of articulated objects. CenterArt takes RGB-D images of the scene as input and first predicts the shape and joint codes through an encoder. The decoder then leverages these codes to reconstruct 3D shapes and estimate 6-DoF grasp poses of the objects. We further develop a mechanism for generating a dataset of 6-DoF grasp ground truth poses for articulated objects. CenterArt is trained on realistic scenes containing multiple articulated objects with randomized designs, textures, lighting conditions, and realistic depths. We perform extensive experiments demonstrating that CenterArt outperforms existing methods in accuracy and robustness.
Paper Structure (9 sections, 2 equations, 2 figures, 1 table)

This paper contains 9 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of CenterArt. First, input RGB-D images are encoded to predict object heatmaps, poses, shape codes, and joint codes in a per-pixel manner. Next, the peaks of heatmaps are used to detect the objects. The SGDF decoder then utilizes the predicted shape code and joint code to output the shape and grasp of detected objects. Finally, the estimated poses are used to transform the predicted 3D shapes and 6-DoF grasps from the canonical frame to the camera frame.
  • Figure 2: Generated kitchen scenes