GDRNPP: A Geometry-guided and Fully Learning-based Object Pose Estimator
Xingyu Liu, Ruida Zhang, Chenyangguang Zhang, Gu Wang, Jiwen Tang, Zhigang Li, Xiangyang Ji
TL;DR
The paper tackles 6D pose estimation by replacing non-differentiable, traditional pipelines with a geometry-guided, fully learning-based approach. It introduces GDRNPP, where a Geometry-Guided Direct Regression Network (GDRN) predicts pose end-to-end from monocular RGB using image-like geometric maps, and a depth-enabled refinement module that establishes robust 3D-3D correspondences through a differentiable 3D optical flow framework guided by coordinate maps. Key contributions include the Patch-P$n$P regression from dense 2D-3D correspondences and surface region attention, a scale-invariant translation representation, and a geometry-guided refinement that leverages depth to improve accuracy while handling symmetry. The method achieves state-of-the-art results on the BOP benchmarks for both RGB and RGB-D data, demonstrating strong accuracy and efficiency suitable for robotics and AR applications, and highlighting the value of integrating geometric priors with end-to-end learning.
Abstract
6D pose estimation of rigid objects is a long-standing and challenging task in computer vision. Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses. Given that direct pose regression networks currently exhibit suboptimal performance, most methods still resort to traditional techniques to varying degrees. For example, top-performing methods often adopt an indirect strategy by first establishing 2D-3D or 3D-3D correspondences followed by applying the RANSAC-based PnP or Kabsch algorithms, and further employing ICP for refinement. Despite the performance enhancement, the integration of traditional techniques makes the networks time-consuming and not end-to-end trainable. Orthogonal to them, this paper introduces a fully learning-based object pose estimator. In this work, we first perform an in-depth investigation of both direct and indirect methods and propose a simple yet effective Geometry-guided Direct Regression Network (GDRN) to learn the 6D pose from monocular images in an end-to-end manner. Afterwards, we introduce a geometry-guided pose refinement module, enhancing pose accuracy when extra depth data is available. Guided by the predicted coordinate map, we build an end-to-end differentiable architecture that establishes robust and accurate 3D-3D correspondences between the observed and rendered RGB-D images to refine the pose. Our enhanced pose estimation pipeline GDRNPP (GDRN Plus Plus) conquered the leaderboard of the BOP Challenge for two consecutive years, becoming the first to surpass all prior methods that relied on traditional techniques in both accuracy and speed. The code and models are available at https://github.com/shanice-l/gdrnpp_bop2022.
