Table of Contents
Fetching ...

Object Pose Estimation via the Aggregation of Diffusion Features

Tianfu Wang, Guosheng Hu, Hongguang Wang

TL;DR

This work proposes three distinct architectures that can effectively capture and aggregate diffusion features of different granularity, greatly improving the generalizability of object pose estimation, and outperforms the state-of-the-art methods by a considerable margin on three popular benchmark datasets.

Abstract

Estimating the pose of objects from images is a crucial task of 3D scene understanding, and recent approaches have shown promising results on very large benchmarks. However, these methods experience a significant performance drop when dealing with unseen objects. We believe that it results from the limited generalizability of image features. To address this problem, we have an in-depth analysis on the features of diffusion models, e.g. Stable Diffusion, which hold substantial potential for modeling unseen objects. Based on this analysis, we then innovatively introduce these diffusion features for object pose estimation. To achieve this, we propose three distinct architectures that can effectively capture and aggregate diffusion features of different granularity, greatly improving the generalizability of object pose estimation. Our approach outperforms the state-of-the-art methods by a considerable margin on three popular benchmark datasets, LM, O-LM, and T-LESS. In particular, our method achieves higher accuracy than the previous best arts on unseen objects: 97.9% vs. 93.5% on Unseen LM, 85.9% vs. 76.3% on Unseen O-LM, showing the strong generalizability of our method. Our code is released at https://github.com/Tianfu18/diff-feats-pose.

Object Pose Estimation via the Aggregation of Diffusion Features

TL;DR

This work proposes three distinct architectures that can effectively capture and aggregate diffusion features of different granularity, greatly improving the generalizability of object pose estimation, and outperforms the state-of-the-art methods by a considerable margin on three popular benchmark datasets.

Abstract

Estimating the pose of objects from images is a crucial task of 3D scene understanding, and recent approaches have shown promising results on very large benchmarks. However, these methods experience a significant performance drop when dealing with unseen objects. We believe that it results from the limited generalizability of image features. To address this problem, we have an in-depth analysis on the features of diffusion models, e.g. Stable Diffusion, which hold substantial potential for modeling unseen objects. Based on this analysis, we then innovatively introduce these diffusion features for object pose estimation. To achieve this, we propose three distinct architectures that can effectively capture and aggregate diffusion features of different granularity, greatly improving the generalizability of object pose estimation. Our approach outperforms the state-of-the-art methods by a considerable margin on three popular benchmark datasets, LM, O-LM, and T-LESS. In particular, our method achieves higher accuracy than the previous best arts on unseen objects: 97.9% vs. 93.5% on Unseen LM, 85.9% vs. 76.3% on Unseen O-LM, showing the strong generalizability of our method. Our code is released at https://github.com/Tianfu18/diff-feats-pose.
Paper Structure (32 sections, 9 equations, 5 figures, 4 tables)

This paper contains 32 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Unseen object pose estimation of one state-of-the-art method nguyen2022templates and our method. (a) We render an image using the ground truth pose of an unseen object. (b) Template-pose nguyen2022templates learns image features by fine-tuning a self-supervised learning he2020momentum pre-trained model. (c) Our method aggregrates features of different granularity from a diffusion model to achieve better pose estimation than template-pose nguyen2022templates on the unseen object. The snowflake and flame symbols represent 'parameters frozen’ and 'fine-tune’, respectively.
  • Figure 2: Feature visualization of LINEMOD. For query and template images, we visualize their 3 features from template-pose nguyen2022templates, Layer 5 and Layer 12 of a diffusion model. These features are projected to a PCA space, and the values of top 3 principal components are assigned to RGB values respectively for visualization. The more similar the colors of two feature images in one column, the more similar in feature space. The two features in one green box are very similar.
  • Figure 3: Diffusion aggregation methods. Arch. (a) is a vanilla aggregation, Arch. (b) is nonlinear aggregation, and Arch. (c) is context-aware weight aggregation.
  • Figure 4: Ablation on timestep $t$. The accuracy is measured for the model with extracting features from Stable Diffusion at different timesteps on LM and O-LM datasets.
  • Figure 5: Qualitative results on unseen objects of O-LM (left) and T-LESS (right).