Learning a Category-level Object Pose Estimator without Pose Annotations
Fengrui Tian, Yaoyao Liu, Adam Kortylewski, Yueqi Duan, Shaoyi Du, Alan Yuille, Angtian Wang
TL;DR
This work tackles category-level 3D object pose estimation without pose annotations by leveraging diffusion-based view generation to create pose-controlled training data from unannotated images. It introduces an image encoder to filter artifacts from diffusion outputs and trains per-instance neural meshes with a canonicalization-based merging strategy to form a category-level representation. A differentiable render-and-compare pipeline is used for evaluation, enabling pose optimization on new images. The approach achieves state-of-the-art results in few-shot settings and strong performance without pose annotations on Pascal3D+ and KITTI, highlighting a scalable path toward pose learning across large object categories.
Abstract
3D object pose estimation is a challenging task. Previous works always require thousands of object images with annotated poses for learning the 3D pose correspondence, which is laborious and time-consuming for labeling. In this paper, we propose to learn a category-level 3D object pose estimator without pose annotations. Instead of using manually annotated images, we leverage diffusion models (e.g., Zero-1-to-3) to generate a set of images under controlled pose differences and propose to learn our object pose estimator with those images. Directly using the original diffusion model leads to images with noisy poses and artifacts. To tackle this issue, firstly, we exploit an image encoder, which is learned from a specially designed contrastive pose learning, to filter the unreasonable details and extract image feature maps. Additionally, we propose a novel learning strategy that allows the model to learn object poses from those generated image sets without knowing the alignment of their canonical poses. Experimental results show that our method has the capability of category-level object pose estimation from a single shot setting (as pose definition), while significantly outperforming other state-of-the-art methods on the few-shot category-level object pose estimation benchmarks.
