Table of Contents
Fetching ...

HccePose(BF): Predicting Front & Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation

Yulin Wang, Mengting Hu, Hongli Li, Chen Luo

TL;DR

This work tackles seen-object pose estimation by predicting coordinates for both the object’s front and back surfaces and densely sampling points between them to create ultra-dense 2D-3D correspondences used by a RANSAC-PnP solver. A key contribution is Hierarchical Continuous Coordinate Encoding (HCCE), which represents surface coordinates as multi-level continuous codes and uses a histogram-based scheme to adapt learning weights across levels, improving stability and accuracy. Empirically, the method achieves competitive BOP scores on seven core datasets and outperforms state-of-the-art RGB-based methods by up to 2.4% in BOP score, with further gains when RGB-D data are involved, and also improves 2D segmentation accuracy. The approach emphasizes dual-surface information and dense sampling to strengthen pose estimation, offering practical improvements for industrial and robotics applications, though the model remains object-specific rather than universally generalizable.

Abstract

In pose estimation for seen objects, a prevalent pipeline involves using neural networks to predict dense 3D coordinates of the object surface on 2D images, which are then used to establish dense 2D-3D correspondences. However, current methods primarily focus on more efficient encoding techniques to improve the precision of predicted 3D coordinates on the object's front surface, overlooking the potential benefits of incorporating the back surface and interior of the object. To better utilize the full surface and interior of the object, this study predicts 3D coordinates of both the object's front and back surfaces and densely samples 3D coordinates between them. This process creates ultra-dense 2D-3D correspondences, effectively enhancing pose estimation accuracy based on the Perspective-n-Point (PnP) algorithm. Additionally, we propose Hierarchical Continuous Coordinate Encoding (HCCE) to provide a more accurate and efficient representation of front and back surface coordinates. Experimental results show that, compared to existing state-of-the-art (SOTA) methods on the BOP website, the proposed approach outperforms across seven classic BOP core datasets. Code is available at https://github.com/WangYuLin-SEU/HCCEPose.

HccePose(BF): Predicting Front & Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation

TL;DR

This work tackles seen-object pose estimation by predicting coordinates for both the object’s front and back surfaces and densely sampling points between them to create ultra-dense 2D-3D correspondences used by a RANSAC-PnP solver. A key contribution is Hierarchical Continuous Coordinate Encoding (HCCE), which represents surface coordinates as multi-level continuous codes and uses a histogram-based scheme to adapt learning weights across levels, improving stability and accuracy. Empirically, the method achieves competitive BOP scores on seven core datasets and outperforms state-of-the-art RGB-based methods by up to 2.4% in BOP score, with further gains when RGB-D data are involved, and also improves 2D segmentation accuracy. The approach emphasizes dual-surface information and dense sampling to strengthen pose estimation, offering practical improvements for industrial and robotics applications, though the model remains object-specific rather than universally generalizable.

Abstract

In pose estimation for seen objects, a prevalent pipeline involves using neural networks to predict dense 3D coordinates of the object surface on 2D images, which are then used to establish dense 2D-3D correspondences. However, current methods primarily focus on more efficient encoding techniques to improve the precision of predicted 3D coordinates on the object's front surface, overlooking the potential benefits of incorporating the back surface and interior of the object. To better utilize the full surface and interior of the object, this study predicts 3D coordinates of both the object's front and back surfaces and densely samples 3D coordinates between them. This process creates ultra-dense 2D-3D correspondences, effectively enhancing pose estimation accuracy based on the Perspective-n-Point (PnP) algorithm. Additionally, we propose Hierarchical Continuous Coordinate Encoding (HCCE) to provide a more accurate and efficient representation of front and back surface coordinates. Experimental results show that, compared to existing state-of-the-art (SOTA) methods on the BOP website, the proposed approach outperforms across seven classic BOP core datasets. Code is available at https://github.com/WangYuLin-SEU/HCCEPose.

Paper Structure

This paper contains 23 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Pose estimation based on ultra-dense 2D-3D correspondences. For the input image, the network predicts the object mask and the coordinates of the object’s front and back surfaces. Based on the predicted coordinates, we perform dense sampling of 3D coordinates between the front and back surfaces to construct ultra-dense 2D-3D correspondences, and use the RANSAC-PnP solver epnp to compute the object pose. Here, $\tilde{Q}_f$, $\tilde{Q}_b$, and $\tilde{Q}_m$ represent the sets of 3D coordinates for the object's front surface, back surface, and interior of the object, respectively. Their corresponding sets of 2D coordinates are denoted as $\tilde{P}_f$, $\tilde{P}_b$, and $\tilde{P}_m$.
  • Figure 2: The inference pipeline. Our method begins by cropping a raw image based on the results of 2D object detection. The cropped image is then used as input to neural networks, which predict both the object mask and the coordinates of the front and back surfaces. To efficiently and accurately represent surface coordinates, we propose HCCE that encodes surface coordinates as multi-level codes, which are then learned by neural networks. During inference, the predicted multi-level continuous codes are first converted into binary codes, which are subsequently used to decode the surface coordinates (illustrated for the front surface coordinates in the figure). Using these predicted front and back surface coordinates, we densely sample 3D coordinates between the two surfaces to construct ultra-dense 2D-3D correspondences. Based on these ultra-dense correspondences, the method applies the RANSAC-PnP epnp to compute the object’s pose.
  • Figure 3: The hierarchical coordinate encoding method. In Hierarchical Binary Coordinate Encoding (HBCE), we observed that neural networks struggle to learn codes near the edges between dark and light stripes. To eliminate these stripes, we propose Hierarchical Continuous Coordinate Encoding (HCCE), which encodes coordinate components as multi-level continuous codes. Here, ${{C}_{1}}$ to ${{C}_{4}}$ represent the hierarchical continuous encoding of the object from the first to the fourth level, respectively. In addition, Continuous Coordinate Encoding (CCE) is an encoding method widely used by approaches such as CDPN bib3, GDR-Net bib43, and Pix2Pose bib2.
  • Figure 4: Accuracy of surface coordinates. Percentage of correctly predicted coordinates under thresholds of 2%, 5%, and 10% of the object diameter.
  • Figure 5: Weight adjustments across different training epochs.