Table of Contents
Fetching ...

Learning a Category-level Object Pose Estimator without Pose Annotations

Fengrui Tian, Yaoyao Liu, Adam Kortylewski, Yueqi Duan, Shaoyi Du, Alan Yuille, Angtian Wang

TL;DR

This work tackles category-level 3D object pose estimation without pose annotations by leveraging diffusion-based view generation to create pose-controlled training data from unannotated images. It introduces an image encoder to filter artifacts from diffusion outputs and trains per-instance neural meshes with a canonicalization-based merging strategy to form a category-level representation. A differentiable render-and-compare pipeline is used for evaluation, enabling pose optimization on new images. The approach achieves state-of-the-art results in few-shot settings and strong performance without pose annotations on Pascal3D+ and KITTI, highlighting a scalable path toward pose learning across large object categories.

Abstract

3D object pose estimation is a challenging task. Previous works always require thousands of object images with annotated poses for learning the 3D pose correspondence, which is laborious and time-consuming for labeling. In this paper, we propose to learn a category-level 3D object pose estimator without pose annotations. Instead of using manually annotated images, we leverage diffusion models (e.g., Zero-1-to-3) to generate a set of images under controlled pose differences and propose to learn our object pose estimator with those images. Directly using the original diffusion model leads to images with noisy poses and artifacts. To tackle this issue, firstly, we exploit an image encoder, which is learned from a specially designed contrastive pose learning, to filter the unreasonable details and extract image feature maps. Additionally, we propose a novel learning strategy that allows the model to learn object poses from those generated image sets without knowing the alignment of their canonical poses. Experimental results show that our method has the capability of category-level object pose estimation from a single shot setting (as pose definition), while significantly outperforming other state-of-the-art methods on the few-shot category-level object pose estimation benchmarks.

Learning a Category-level Object Pose Estimator without Pose Annotations

TL;DR

This work tackles category-level 3D object pose estimation without pose annotations by leveraging diffusion-based view generation to create pose-controlled training data from unannotated images. It introduces an image encoder to filter artifacts from diffusion outputs and trains per-instance neural meshes with a canonicalization-based merging strategy to form a category-level representation. A differentiable render-and-compare pipeline is used for evaluation, enabling pose optimization on new images. The approach achieves state-of-the-art results in few-shot settings and strong performance without pose annotations on Pascal3D+ and KITTI, highlighting a scalable path toward pose learning across large object categories.

Abstract

3D object pose estimation is a challenging task. Previous works always require thousands of object images with annotated poses for learning the 3D pose correspondence, which is laborious and time-consuming for labeling. In this paper, we propose to learn a category-level 3D object pose estimator without pose annotations. Instead of using manually annotated images, we leverage diffusion models (e.g., Zero-1-to-3) to generate a set of images under controlled pose differences and propose to learn our object pose estimator with those images. Directly using the original diffusion model leads to images with noisy poses and artifacts. To tackle this issue, firstly, we exploit an image encoder, which is learned from a specially designed contrastive pose learning, to filter the unreasonable details and extract image feature maps. Additionally, we propose a novel learning strategy that allows the model to learn object poses from those generated image sets without knowing the alignment of their canonical poses. Experimental results show that our method has the capability of category-level object pose estimation from a single shot setting (as pose definition), while significantly outperforming other state-of-the-art methods on the few-shot category-level object pose estimation benchmarks.
Paper Structure (27 sections, 12 equations, 6 figures, 4 tables)

This paper contains 27 sections, 12 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: We propose to learn a category-level object pose estimator from the multiple views of the objects in the category. As shown in the figure, by leveraging the generative diffusion model, we generate novel views of an object with controlled poses $(\omega_{az}, \omega_{el})$ and learn the pose estimator from the multiple views of the objects in the category and their controlled poses.
  • Figure 2: Posed image generation. Given an image containing an object and the pose of a target view $(\omega_{az}, \omega_{el})$, we control the diffusion model to generate the target view of the object by leveraging the view pose.
  • Figure 3: The training pipeline of our model. Given a set of object images, we define the object poses on these images as zero-poses (i.e., $(\omega_{az}, \omega_{el})=(0,0)$). Then given the target view poses, we exploit the generative diffusion model to generate the target views of the objects. We introduce an image encoder to extract the image feature maps of these view images. We exploit the image feature maps with the corresponding target poses to optimize the neural mesh for each object.
  • Figure 4: Neural mesh merging strategy and evaluation pipeline. After training, we estimate the relative pose between two meshes and merge two neural meshes with the estimated pose. In the evaluation, We consider the pose of a novel image as a learnable matrix. We render the feature map from the merged neural mesh with the learnable pose and optimize the pose by comparing the rendered feature map with the extracted feature map of the novel image.
  • Figure 5: Qualitative results of training the pose estimator without pose annotations. We present four cases on the car, motorbike, aeroplane and bus categories, respectively. We use the CAD model of each category for better visualization and present the pose estimation errors on the top of the prediction images. Without requiring any pose annotations, our model still successfully predicts object poses in low pose estimation errors.
  • ...and 1 more figures