Table of Contents
Fetching ...

Diffusion Features for Zero-Shot 6DoF Object Pose Estimation

Bernd Von Gimborn, Philipp Ausserlechner, Markus Vincze, Stefan Thalhammer

TL;DR

A template-based multi-staged method for estimating poses in a zero-shot fashion using LDMs is presented and the efficacy of the proposed approach is empirically evaluated on three standard datasets for object-specific 6DoF pose estimation.

Abstract

Zero-shot object pose estimation enables the retrieval of object poses from images without necessitating object-specific training. In recent approaches this is facilitated by vision foundation models (VFM), which are pre-trained models that are effectively general-purpose feature extractors. The characteristics exhibited by these VFMs vary depending on the training data, network architecture, and training paradigm. The prevailing choice in this field are self-supervised Vision Transformers (ViT). This study assesses the influence of Latent Diffusion Model (LDM) backbones on zero-shot pose estimation. In order to facilitate a comparison between the two families of models on a common ground we adopt and modify a recent approach. Therefore, a template-based multi-staged method for estimating poses in a zero-shot fashion using LDMs is presented. The efficacy of the proposed approach is empirically evaluated on three standard datasets for object-specific 6DoF pose estimation. The experiments demonstrate an Average Recall improvement of up to 27% over the ViT baseline. The source code is available at: https://github.com/BvG1993/DZOP.

Diffusion Features for Zero-Shot 6DoF Object Pose Estimation

TL;DR

A template-based multi-staged method for estimating poses in a zero-shot fashion using LDMs is presented and the efficacy of the proposed approach is empirically evaluated on three standard datasets for object-specific 6DoF pose estimation.

Abstract

Zero-shot object pose estimation enables the retrieval of object poses from images without necessitating object-specific training. In recent approaches this is facilitated by vision foundation models (VFM), which are pre-trained models that are effectively general-purpose feature extractors. The characteristics exhibited by these VFMs vary depending on the training data, network architecture, and training paradigm. The prevailing choice in this field are self-supervised Vision Transformers (ViT). This study assesses the influence of Latent Diffusion Model (LDM) backbones on zero-shot pose estimation. In order to facilitate a comparison between the two families of models on a common ground we adopt and modify a recent approach. Therefore, a template-based multi-staged method for estimating poses in a zero-shot fashion using LDMs is presented. The efficacy of the proposed approach is empirically evaluated on three standard datasets for object-specific 6DoF pose estimation. The experiments demonstrate an Average Recall improvement of up to 27% over the ViT baseline. The source code is available at: https://github.com/BvG1993/DZOP.

Paper Structure

This paper contains 16 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: DZOP overview For estimating poses a scene-level RGB image and object meshes of the object of interest are expected. The features of the query and template images are extracted using Stable Diffusion Rombach_2022_CVPR. Templates are matched using the cosine similarity on the feature maps of the second decoder layer of U-net's ronneberger2015u second decoder layer. Semantic correspondences are estimated from clustered hyperfeatures. Ultimately, geometric correspondences are derived and poses are estimated using Perspective-n-Points hartley2003multiple.
  • Figure 2: Semantic correspondence estimation First, the query and template features are co-projected to a lower-dimensional space. Subsequently, corresponding clusters are created using cosine similarity. Ultimately, features are matched within corresponding clusters and refined to sub-pixel accuracy.
  • Figure 3: Per-object AR Reported are the per object improvements of DZOP in comparison to ZS6D.
  • Figure 4: AR over error tolerance threshold Presented are the AR values for different upper bounds of $\theta$.
  • Figure 5: Qualitative results Exemple pose estimates from DZOP and ZS6D. Blue, green, and red boxes indicate ground truth, correct, and incorrect pose estimates respectively.
  • ...and 1 more figures