Table of Contents
Fetching ...

GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence

Van Nguyen Nguyen, Thibault Groueix, Mathieu Salzmann, Vincent Lepetit

TL;DR

GigaPose addresses the need for fast, robust CAD-based coarse 6D pose estimation of novel objects from RGB images by decoupling the pose into out-of-plane rotation captured with 162 templates and the remaining four DoFs recovered from patch correspondences. A ViT-based ${\bf F}_{\text{ae}}$ learns local features via local-contrastive training to match templates to input segments, while ${\bf F}_{\text{ist}}$ and two lightweight MLPs predict 2D scale $s$, in-plane rotation $\alpha$, and 2D translation from 2D–2D matches; ${\bf M}_{t \rightarrow q}$ is refined with a RANSAC loop. On seven core BOP datasets, GigaPose achieves state-of-the-art accuracy and is significantly faster—about a $35\times$ speedup over MegaPose for coarse estimation—while exhibiting enhanced robustness to segmentation errors. The method remains compatible with refinement networks and can leverage 3D models predicted from a single image (Wonder3D), reducing the CAD-model burden and making real-time 6D pose estimation of novel objects more practical for industrial deployment.

Abstract

We present GigaPose, a fast, robust, and accurate method for CAD-based novel object pose estimation in RGB images. GigaPose first leverages discriminative "templates", rendered images of the CAD models, to recover the out-of-plane rotation and then uses patch correspondences to estimate the four remaining parameters. Our approach samples templates in only a two-degrees-of-freedom space instead of the usual three and matches the input image to the templates using fast nearest-neighbor search in feature space, results in a speedup factor of 35x compared to the state of the art. Moreover, GigaPose is significantly more robust to segmentation errors. Our extensive evaluation on the seven core datasets of the BOP challenge demonstrates that it achieves state-of-the-art accuracy and can be seamlessly integrated with existing refinement methods. Additionally, we show the potential of GigaPose with 3D models predicted by recent work on 3D reconstruction from a single image, relaxing the need for CAD models and making 6D pose object estimation much more convenient. Our source code and trained models are publicly available at https://github.com/nv-nguyen/gigaPose

GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence

TL;DR

GigaPose addresses the need for fast, robust CAD-based coarse 6D pose estimation of novel objects from RGB images by decoupling the pose into out-of-plane rotation captured with 162 templates and the remaining four DoFs recovered from patch correspondences. A ViT-based learns local features via local-contrastive training to match templates to input segments, while and two lightweight MLPs predict 2D scale , in-plane rotation , and 2D translation from 2D–2D matches; is refined with a RANSAC loop. On seven core BOP datasets, GigaPose achieves state-of-the-art accuracy and is significantly faster—about a speedup over MegaPose for coarse estimation—while exhibiting enhanced robustness to segmentation errors. The method remains compatible with refinement networks and can leverage 3D models predicted from a single image (Wonder3D), reducing the CAD-model burden and making real-time 6D pose estimation of novel objects more practical for industrial deployment.

Abstract

We present GigaPose, a fast, robust, and accurate method for CAD-based novel object pose estimation in RGB images. GigaPose first leverages discriminative "templates", rendered images of the CAD models, to recover the out-of-plane rotation and then uses patch correspondences to estimate the four remaining parameters. Our approach samples templates in only a two-degrees-of-freedom space instead of the usual three and matches the input image to the templates using fast nearest-neighbor search in feature space, results in a speedup factor of 35x compared to the state of the art. Moreover, GigaPose is significantly more robust to segmentation errors. Our extensive evaluation on the seven core datasets of the BOP challenge demonstrates that it achieves state-of-the-art accuracy and can be seamlessly integrated with existing refinement methods. Additionally, we show the potential of GigaPose with 3D models predicted by recent work on 3D reconstruction from a single image, relaxing the need for CAD models and making 6D pose object estimation much more convenient. Our source code and trained models are publicly available at https://github.com/nv-nguyen/gigaPose
Paper Structure (32 sections, 18 equations, 13 figures, 5 tables)

This paper contains 32 sections, 18 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Comparison of our method GigaPose with MegaPose megapose. GigaPose is (i) more robust to noisy segmentation, often due to occlusions, (ii) more accurate with 3.5 % average precision improvement on the BOP benchmark sundermeyer2023bop, and (iii) significantly faster with a speed up factor of 35$\times$ per detection for coarse object pose estimation stage (0.048 s vs 1.68 s). Left example compares the results using accurate 3D models, while the right example shows the results with 3D models predicted from a single image by Wonder3D long2023wonder3d. The bottom row shows the input segmentation, and the depth error heatmap of each detected object with respect to the ground truth pose, i.e the distance between each 3D point in the ground-truth depth map and its position with the predicted pose (legend: 0 cm 10 cm).
  • Figure 2: Overview. We first onboard each novel object by rendering 162 templates, spanning the spectrum of out-of-plane rotations. We also extract dense features using ${\bf F}_\text{ae}$ from each of the templates. At runtime, given a query image segmented with CNOS nguyen2023cnos, we process it (by masking the background, cropping on the segment, adding padding then resizing), and extracting features with ${\bf F}_\text{ae}$. We retrieve the nearest template to the segment using the similarity metric detailed in Section \ref{['sec:azimuthElevation']}. Further, 2D scale and in-plane rotation are computed from a single 2D-2D correspondence using ${\bf F}_\text{ist}$ and two lightweight MLPs. The 2D position of the correspondences also gives us the 2D translation which is used with 2D scale, in-plane rotation to create the affine transformation ${\bf M}_{t \rightarrow q}$, mapping the nearest template to the query image. This enables us to recover the complete 6D object pose from a single correspondence. Finally, we use RANSAC to robustly find the best pose candidate. Onboarding takes 11.5 seconds per object and inference takes 48 milliseconds per detection on average.
  • Figure 3: Contrastive training of ${\bf F}_\text{ae}$. We use pairs made of a query image and a template to train a network using local contrastive learning as detailed in Section \ref{['sec:azimuthElevation']}. Middle: Training samples provided by megapose, and the 2D-2D correspondences created from ground-truth 3D information used to generate positive and negative pairs. Right: We seek local features that vary with the out-of-plane rotation, but are invariant to in-plane rotation and scaling. Thus, positive pairs are made of corresponding patches under scaling and in-plane rotation changes, and negative pairs are made of corresponding patches under different out-of-plane rotations, patches that do not correspond, or that come from different objects.
  • Figure 4: Qualitative results on LM-O brachmann-eccv14-learning6dobjectposeestimation. The first column shows the ground-truth and CNOS nguyen2023cnos segmentation. The second and third columns show the results without refinement for both MegaPose megapose and our method, including depth error heatmaps at the bottom. The last two columns compare the results using the same refinement megapose for MegaPose megapose and our method. In the error heatmap, darker red indicates higher error with respect to the ground truth pose (legend: 0 cm 10 cm). As demonstrated in this figure, our method estimates a more accurate coarse pose and avoids local minima during refinement, such as with the white "watering_can" object from LM-O.
  • Figure 5: 3D recontruction by Wonder3D long2023wonder3d. The first row displays the input reference image, the second shows the predicted normal maps from the view opposite to the reference image. More visualizations are provided in the supplementary material.
  • ...and 8 more figures