Table of Contents
Fetching ...

FoundPose: Unseen Object Pose Estimation with Foundation Features

Evin Pınar Örnek, Yann Labbé, Bugra Tekin, Lingni Ma, Cem Keskin, Christian Forster, Tomas Hodan

TL;DR

This paper tackles RGB-only 6D pose estimation of unseen objects with minimal onboarding by leveraging foundation-model features. It introduces FoundPose, a training-free, model-based pipeline that renders RGB-D templates, extracts DINOv2 patch descriptors, and aligns image-to-model pose by establishing 2D-3D correspondences via PnP-RANSAC, followed by featuremetric refinement. Key contributions include an efficient bag-of-words template retrieval, a lightweight template-based representation with a 25x lower memory footprint, and the demonstrated importance of intermediate-layer DINOv2 descriptors for symmetric and textureless objects. On seven core BOP datasets, the method achieves RGB-only state-of-the-art results and can be combined with an additional render-and-compare refinement to further improve accuracy, with offline onboarding under 5 minutes. These results suggest that foundation-model features can revive efficient classical CV pipelines for scalable unseen-object pose estimation.

Abstract

We propose FoundPose, a model-based method for 6D pose estimation of unseen objects from a single RGB image. The method can quickly onboard new objects using their 3D models without requiring any object- or task-specific training. In contrast, existing methods typically pre-train on large-scale, task-specific datasets in order to generalize to new objects and to bridge the image-to-model domain gap. We demonstrate that such generalization capabilities can be observed in a recent vision foundation model trained in a self-supervised manner. Specifically, our method estimates the object pose from image-to-model 2D-3D correspondences, which are established by matching patch descriptors from the recent DINOv2 model between the image and pre-rendered object templates. We find that reliable correspondences can be established by kNN matching of patch descriptors from an intermediate DINOv2 layer. Such descriptors carry stronger positional information than descriptors from the last layer, and we show their importance when semantic information is ambiguous due to object symmetries or a lack of texture. To avoid establishing correspondences against all object templates, we develop an efficient template retrieval approach that integrates the patch descriptors into the bag-of-words representation and can promptly propose a handful of similarly looking templates. Additionally, we apply featuremetric alignment to compensate for discrepancies in the 2D-3D correspondences caused by coarse patch sampling. The resulting method noticeably outperforms existing RGB methods for refinement-free pose estimation on the standard BOP benchmark with seven diverse datasets and can be seamlessly combined with an existing render-and-compare refinement method to achieve RGB-only state-of-the-art results. Project page: evinpinar.github.io/foundpose.

FoundPose: Unseen Object Pose Estimation with Foundation Features

TL;DR

This paper tackles RGB-only 6D pose estimation of unseen objects with minimal onboarding by leveraging foundation-model features. It introduces FoundPose, a training-free, model-based pipeline that renders RGB-D templates, extracts DINOv2 patch descriptors, and aligns image-to-model pose by establishing 2D-3D correspondences via PnP-RANSAC, followed by featuremetric refinement. Key contributions include an efficient bag-of-words template retrieval, a lightweight template-based representation with a 25x lower memory footprint, and the demonstrated importance of intermediate-layer DINOv2 descriptors for symmetric and textureless objects. On seven core BOP datasets, the method achieves RGB-only state-of-the-art results and can be combined with an additional render-and-compare refinement to further improve accuracy, with offline onboarding under 5 minutes. These results suggest that foundation-model features can revive efficient classical CV pipelines for scalable unseen-object pose estimation.

Abstract

We propose FoundPose, a model-based method for 6D pose estimation of unseen objects from a single RGB image. The method can quickly onboard new objects using their 3D models without requiring any object- or task-specific training. In contrast, existing methods typically pre-train on large-scale, task-specific datasets in order to generalize to new objects and to bridge the image-to-model domain gap. We demonstrate that such generalization capabilities can be observed in a recent vision foundation model trained in a self-supervised manner. Specifically, our method estimates the object pose from image-to-model 2D-3D correspondences, which are established by matching patch descriptors from the recent DINOv2 model between the image and pre-rendered object templates. We find that reliable correspondences can be established by kNN matching of patch descriptors from an intermediate DINOv2 layer. Such descriptors carry stronger positional information than descriptors from the last layer, and we show their importance when semantic information is ambiguous due to object symmetries or a lack of texture. To avoid establishing correspondences against all object templates, we develop an efficient template retrieval approach that integrates the patch descriptors into the bag-of-words representation and can promptly propose a handful of similarly looking templates. Additionally, we apply featuremetric alignment to compensate for discrepancies in the 2D-3D correspondences caused by coarse patch sampling. The resulting method noticeably outperforms existing RGB methods for refinement-free pose estimation on the standard BOP benchmark with seven diverse datasets and can be seamlessly combined with an existing render-and-compare refinement method to achieve RGB-only state-of-the-art results. Project page: evinpinar.github.io/foundpose.
Paper Structure (13 sections, 1 equation, 4 figures, 2 tables)

This paper contains 13 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Bridging synthetic-to-real gap. Patch descriptors from an intermediate layer of DINOv2oquab2023dinov2 (top), a recent vision foundation model, are the key enabler of FoundPose. Thanks to the generalization capability of these descriptors, it is possible to establish reliable correspondences between a real query image (left) and a synthetic template (right) by a simple nearest-neighbor matching. The patch descriptors are colored by the top three components of a PCA space computed from descriptors of all object templates. Note that colors of the same object parts are consistent, despite the real-to-synthetic domain gap.
  • Figure 2: FoundPose overview. During a short onboarding stage, we render RGB-D templates showing the object in different orientations, extract DINOv2 patch descriptors oquab2023dinov2darcet2023vision from the RGB channels and register the descriptors in 3D using the depth channel. At inference time, we crop the RGB query image around the object mask predicted by CNOS nguyen2023cnos and retrieve a small set of most similar templates using a bag-of-words approach (with words defined by k-means clusters of patch descriptors from all templates). For each retrieved template, a pose hypothesis is generated by PnP-RANSAC fischler1981randomlepetit2009epnp from 2D-3D correspondences established by matching patch descriptors of the image crop and the template. Finally, the pose hypothesis with the highest number of inlier correspondences is refined by featuremetric alignment.
  • Figure 3: Visualization of DINOv2 patch descriptors. Shown are top three PCA components of patch descriptors from different layers of DINOv2 ViT-Ldarcet2023vision, for a textured object from YCB-Vxiang2018posecnn (top) and a symmetric and texture-less object from T-LESS hodan2017tless (bottom). As observed inamir2021deep and also clearly visible in these visuals, the patch descriptors contain gradually less positional and more semantic information when going from shallower to deeper layers -- the different coloring of object sides (red left vs yellow right) in Layer 13 gradually blends to a solid color (orange) in Layer 23. FoundPose performs the best with descriptors from layer 18, which presumably provides the right information mix. We observed that these descriptors produce geometrically consistent correspondences even on symmetric and texture-less objects -- when the semantic information is ambiguous (due to symmetries or a lack of texture), the positional information prioritizes matching patches from the same object side.
  • Figure 4: Example FoundPose results on HB, LM-O, IC-BIN, TUD-L, ITODD and T-LESS datasets, showing that our method can handle a broad range or objects, including textured, texture-less and symmetric ones. Each example shows the query image crop with the CNOS mask in white (top left), retrieved templates (middle row), matched patch descriptors of the crop and the template that led to the top-quality pose estimate (bottom row), and the contour of the ground-truth pose in red, the coarse pose in blue, and the refined pose in green (top right).