CenterArt: Joint Shape Reconstruction and 6-DoF Grasp Estimation of Articulated Objects

Sassan Mokhtar; Eugenio Chisari; Nick Heppert; Abhinav Valada

CenterArt: Joint Shape Reconstruction and 6-DoF Grasp Estimation of Articulated Objects

Sassan Mokhtar, Eugenio Chisari, Nick Heppert, Abhinav Valada

TL;DR

Addresses joint 3D shape reconstruction and 6-DoF grasp estimation for articulated objects from RGB-D. It introduces CenterArt, a vision-based architecture with an image encoder and a SGDF-based decoder that uses shape and joint latent codes to predict geometry and 6-DoF grasps, transforming results into the camera frame. A two-pronged dataset generation strategy builds a large set of valid 6-DoF grasps from PartNet-Mobility objects and realistic Sapien kitchen scenes. Empirical results show CenterArt outperforming the RL-based baseline (UMPNet) by up to $52\%$ SR on simple scenes and demonstrates robustness to depth noise and scene complexity, with an overall improvement of about $28\%$ in SR across tested scenarios.

Abstract

Precisely grasping and reconstructing articulated objects is key to enabling general robotic manipulation. In this paper, we propose CenterArt, a novel approach for simultaneous 3D shape reconstruction and 6-DoF grasp estimation of articulated objects. CenterArt takes RGB-D images of the scene as input and first predicts the shape and joint codes through an encoder. The decoder then leverages these codes to reconstruct 3D shapes and estimate 6-DoF grasp poses of the objects. We further develop a mechanism for generating a dataset of 6-DoF grasp ground truth poses for articulated objects. CenterArt is trained on realistic scenes containing multiple articulated objects with randomized designs, textures, lighting conditions, and realistic depths. We perform extensive experiments demonstrating that CenterArt outperforms existing methods in accuracy and robustness.

CenterArt: Joint Shape Reconstruction and 6-DoF Grasp Estimation of Articulated Objects

TL;DR

SR on simple scenes and demonstrates robustness to depth noise and scene complexity, with an overall improvement of about

in SR across tested scenarios.

Abstract

Paper Structure (9 sections, 2 equations, 2 figures, 1 table)

This paper contains 9 sections, 2 equations, 2 figures, 1 table.

Introduction
Related Work
Technical Approach
Image Encoder
Shape and Grasp Decoder
Full CenterArt Inference
Dataset Generation
Experimental Results
Conclusion

Figures (2)

Figure 1: Overview of CenterArt. First, input RGB-D images are encoded to predict object heatmaps, poses, shape codes, and joint codes in a per-pixel manner. Next, the peaks of heatmaps are used to detect the objects. The SGDF decoder then utilizes the predicted shape code and joint code to output the shape and grasp of detected objects. Finally, the estimated poses are used to transform the predicted 3D shapes and 6-DoF grasps from the canonical frame to the camera frame.
Figure 2: Generated kitchen scenes

CenterArt: Joint Shape Reconstruction and 6-DoF Grasp Estimation of Articulated Objects

TL;DR

Abstract

CenterArt: Joint Shape Reconstruction and 6-DoF Grasp Estimation of Articulated Objects

Authors

TL;DR

Abstract

Table of Contents

Figures (2)