Table of Contents
Fetching ...

GDRNPP: A Geometry-guided and Fully Learning-based Object Pose Estimator

Xingyu Liu, Ruida Zhang, Chenyangguang Zhang, Gu Wang, Jiwen Tang, Zhigang Li, Xiangyang Ji

TL;DR

The paper tackles 6D pose estimation by replacing non-differentiable, traditional pipelines with a geometry-guided, fully learning-based approach. It introduces GDRNPP, where a Geometry-Guided Direct Regression Network (GDRN) predicts pose end-to-end from monocular RGB using image-like geometric maps, and a depth-enabled refinement module that establishes robust 3D-3D correspondences through a differentiable 3D optical flow framework guided by coordinate maps. Key contributions include the Patch-P$n$P regression from dense 2D-3D correspondences and surface region attention, a scale-invariant translation representation, and a geometry-guided refinement that leverages depth to improve accuracy while handling symmetry. The method achieves state-of-the-art results on the BOP benchmarks for both RGB and RGB-D data, demonstrating strong accuracy and efficiency suitable for robotics and AR applications, and highlighting the value of integrating geometric priors with end-to-end learning.

Abstract

6D pose estimation of rigid objects is a long-standing and challenging task in computer vision. Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses. Given that direct pose regression networks currently exhibit suboptimal performance, most methods still resort to traditional techniques to varying degrees. For example, top-performing methods often adopt an indirect strategy by first establishing 2D-3D or 3D-3D correspondences followed by applying the RANSAC-based PnP or Kabsch algorithms, and further employing ICP for refinement. Despite the performance enhancement, the integration of traditional techniques makes the networks time-consuming and not end-to-end trainable. Orthogonal to them, this paper introduces a fully learning-based object pose estimator. In this work, we first perform an in-depth investigation of both direct and indirect methods and propose a simple yet effective Geometry-guided Direct Regression Network (GDRN) to learn the 6D pose from monocular images in an end-to-end manner. Afterwards, we introduce a geometry-guided pose refinement module, enhancing pose accuracy when extra depth data is available. Guided by the predicted coordinate map, we build an end-to-end differentiable architecture that establishes robust and accurate 3D-3D correspondences between the observed and rendered RGB-D images to refine the pose. Our enhanced pose estimation pipeline GDRNPP (GDRN Plus Plus) conquered the leaderboard of the BOP Challenge for two consecutive years, becoming the first to surpass all prior methods that relied on traditional techniques in both accuracy and speed. The code and models are available at https://github.com/shanice-l/gdrnpp_bop2022.

GDRNPP: A Geometry-guided and Fully Learning-based Object Pose Estimator

TL;DR

The paper tackles 6D pose estimation by replacing non-differentiable, traditional pipelines with a geometry-guided, fully learning-based approach. It introduces GDRNPP, where a Geometry-Guided Direct Regression Network (GDRN) predicts pose end-to-end from monocular RGB using image-like geometric maps, and a depth-enabled refinement module that establishes robust 3D-3D correspondences through a differentiable 3D optical flow framework guided by coordinate maps. Key contributions include the Patch-PP regression from dense 2D-3D correspondences and surface region attention, a scale-invariant translation representation, and a geometry-guided refinement that leverages depth to improve accuracy while handling symmetry. The method achieves state-of-the-art results on the BOP benchmarks for both RGB and RGB-D data, demonstrating strong accuracy and efficiency suitable for robotics and AR applications, and highlighting the value of integrating geometric priors with end-to-end learning.

Abstract

6D pose estimation of rigid objects is a long-standing and challenging task in computer vision. Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses. Given that direct pose regression networks currently exhibit suboptimal performance, most methods still resort to traditional techniques to varying degrees. For example, top-performing methods often adopt an indirect strategy by first establishing 2D-3D or 3D-3D correspondences followed by applying the RANSAC-based PnP or Kabsch algorithms, and further employing ICP for refinement. Despite the performance enhancement, the integration of traditional techniques makes the networks time-consuming and not end-to-end trainable. Orthogonal to them, this paper introduces a fully learning-based object pose estimator. In this work, we first perform an in-depth investigation of both direct and indirect methods and propose a simple yet effective Geometry-guided Direct Regression Network (GDRN) to learn the 6D pose from monocular images in an end-to-end manner. Afterwards, we introduce a geometry-guided pose refinement module, enhancing pose accuracy when extra depth data is available. Guided by the predicted coordinate map, we build an end-to-end differentiable architecture that establishes robust and accurate 3D-3D correspondences between the observed and rendered RGB-D images to refine the pose. Our enhanced pose estimation pipeline GDRNPP (GDRN Plus Plus) conquered the leaderboard of the BOP Challenge for two consecutive years, becoming the first to surpass all prior methods that relied on traditional techniques in both accuracy and speed. The code and models are available at https://github.com/shanice-l/gdrnpp_bop2022.

Paper Structure

This paper contains 18 sections, 16 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of GDRNPP. Firstly, we directly regress the 6D object pose from a single RGB using a CNN and the learnable Patch-P$n$P by leveraging the guidance of intermediate geometric features including 2D-3D dense correspondences and surface region attention. Moreover, when depth information is available, the network predicts the 3D optical flow to establish 3D-3D correspondences between the observed and rendered RGB-D image to refine the pose. The details are elaborated in Fig. \ref{['fig:framework']} and Fig. \ref{['fig:refinement_pipeline']}.
  • Figure 2: Framework of GDRN. Given an RGB image $I$, our GDRN takes the zoomed-in RoI (Dynamic Zoom-In for training, off-the-shelf detections for testing) as input and predicts several intermediate geometric features. Then the Patch-P$n$P directly regresses the 6D object pose from Dense Correspondences ($\mathbf{M}_\text{2D-3D}$) and Surface Region Attention ($\mathbf{M}_\text{SRA}$).
  • Figure 3: Framework of the Refinement Module. Starting with an initial pose $P_0^{(0)}$, perturbations are applied to generate a set of object poses $\{P_i \, | \, i = 1, 2, \dots, n\}$. Correspondences between the observed image $I_0$ and the rendered images $\{I_i\}$ are established in two parallel ways: (1) using a coordinate-guided 3D optical flow estimator to obtain $x_{0 \rightarrow i}^{(flow, t)}$ and $x_{i \rightarrow 0}^{(flow, t)}$, and (2) using the predicted pose to derive $x_{0 \rightarrow i}^{(pose, t)}$ and $x_{i \rightarrow 0}^{(pose, t)}$. By aligning these correspondences, the pose $P_0$ is iteratively refined, updating $P_0^{(t)}$ to $P_0^{(t+1)}$. This optimization is repeated for $T=10$ iterations (inner loop), after which a new set of poses $\{P_i\}$ is generated, and the corresponding images are rendered. The entire process is repeated $N_\text{out}=4$ times (outer loop) to achieve the final result.
  • Figure 4: Overview of the 3D optical flow estimator. We first use the correspondences inferred from the previous pose prediction to sample the rendered coordinate map $\mathbf{C}_i$ and get $\mathbf{C}'_0$. Then we concatenate the predicted coordinate map $\mathbf{C}_0$ and $\mathbf{C}_0'$ and mask the visible region. The coordinate feature $\mathbf{c}_{0 \rightarrow i}$ is extracted by a convolutional network $\Lambda$ and weighted dynamically by $\mathbf{\omega}_c$ according to the quality of the coordinate map. The weighted coordinate feature, context feature, depth feature, the correlation feature $\mathbf{s}_{0 \rightarrow i}$, along with the hidden state $\mathbf{h}_{0 \rightarrow i}$ are fed into the GRU-based update module, which outputs the correspondences $\mathbf{x}_{0 \rightarrow i}^{(\text{flow}, t)}$ and a new hidden state $\mathbf{h}_{0 \rightarrow i}^{(t)}$. The correspondences $\mathbf{x}_{i \rightarrow 0}^{(\text{flow}, t)}$ are calculated in a symmetric manner.
  • Figure 5: Results of P$n$P variants on Synthetic Sphere.(a, b): We compare our Patch-P$n$P module with the traditional RANSAC EP$n$P lepetit2009epnp and another learning-based P$n$P hu2020single_stage. The pose error is reported as relative ADD error w.r.t. the sphere's diameter (y-axis in log-scale). (c): Zoomed-In ($64\times64$) synthetic examples for Patch-P$n$P.
  • ...and 3 more figures