Table of Contents
Fetching ...

Universal Features Guided Zero-Shot Category-Level Object Pose Estimation

Wentian Qu, Chenyu Meng, Heng Li, Jian Cheng, Cuixia Ma, Hongan Wang, Xiao Zhou, Xiaoming Deng, Ping Tan

TL;DR

The paper tackles zero-shot category-level object pose estimation by leveraging multi-modal universal features from RGB-D data to generalize to unseen categories without fine-tuning. It introduces a coarse-to-fine pipeline that first uses 2D universal features to establish sparse correspondences for a coarse $6$-DOF pose, then employs iterative refinement and a dense 3D universal-feature alignment to resolve pose–shape ambiguities. Core contributions include integrating 2D and 3D universal features (DINOv2, Stable Diffusion, and DGCNN) with an iterative correspondence strategy and a universal alignment loss to jointly optimize pose and shape, demonstrated on REAL275 and Wild6D. The method achieves superior generalization to unseen categories and robust pose estimation, enabling practical deployment without category-specific training.

Abstract

Object pose estimation, crucial in computer vision and robotics applications, faces challenges with the diversity of unseen categories. We propose a zero-shot method to achieve category-level 6-DOF object pose estimation, which exploits both 2D and 3D universal features of input RGB-D image to establish semantic similarity-based correspondences and can be extended to unseen categories without additional model fine-tuning. Our method begins with combining efficient 2D universal features to find sparse correspondences between intra-category objects and gets initial coarse pose. To handle the correspondence degradation of 2D universal features if the pose deviates much from the target pose, we use an iterative strategy to optimize the pose. Subsequently, to resolve pose ambiguities due to shape differences between intra-category objects, the coarse pose is refined by optimizing with dense alignment constraint of 3D universal features. Our method outperforms previous methods on the REAL275 and Wild6D benchmarks for unseen categories.

Universal Features Guided Zero-Shot Category-Level Object Pose Estimation

TL;DR

The paper tackles zero-shot category-level object pose estimation by leveraging multi-modal universal features from RGB-D data to generalize to unseen categories without fine-tuning. It introduces a coarse-to-fine pipeline that first uses 2D universal features to establish sparse correspondences for a coarse -DOF pose, then employs iterative refinement and a dense 3D universal-feature alignment to resolve pose–shape ambiguities. Core contributions include integrating 2D and 3D universal features (DINOv2, Stable Diffusion, and DGCNN) with an iterative correspondence strategy and a universal alignment loss to jointly optimize pose and shape, demonstrated on REAL275 and Wild6D. The method achieves superior generalization to unseen categories and robust pose estimation, enabling practical deployment without category-specific training.

Abstract

Object pose estimation, crucial in computer vision and robotics applications, faces challenges with the diversity of unseen categories. We propose a zero-shot method to achieve category-level 6-DOF object pose estimation, which exploits both 2D and 3D universal features of input RGB-D image to establish semantic similarity-based correspondences and can be extended to unseen categories without additional model fine-tuning. Our method begins with combining efficient 2D universal features to find sparse correspondences between intra-category objects and gets initial coarse pose. To handle the correspondence degradation of 2D universal features if the pose deviates much from the target pose, we use an iterative strategy to optimize the pose. Subsequently, to resolve pose ambiguities due to shape differences between intra-category objects, the coarse pose is refined by optimizing with dense alignment constraint of 3D universal features. Our method outperforms previous methods on the REAL275 and Wild6D benchmarks for unseen categories.
Paper Structure (26 sections, 4 equations, 10 figures, 6 tables)

This paper contains 26 sections, 4 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: (a) We propose a zero-shot pose estimation method for unseen categories using universal features and obtain accurate results for multi-category scenes. Our method offers cost-efficient and superior generalization ability over traditional instance-level and category-level methods. (b) The correspondence with universal features degrades when pose has large gaps. (c) The shape gap between objects will cause pose ambiguity in optimization. These challenges affect the accuracy of pose estimation.
  • Figure 2: Overview. Our framework includes a keypoint-level coarse pose estimation module and a pixel-level pose refinement module. In the first module, we establish the correspondences between image pairs based on the 2D universal features and calculate the coarse pose using least squares in an iterative manner. In the second module, we use pixel-level optimization combined with 3D universal features to refine the pose and shape of reference model to obtain the fine pose.
  • Figure 3: Feature Performance Drop and Effect of Iterative Estimation. When there are large pose differences between objects, the 2D universal features similarity degrade. After iterative optimization, as the objects are gradually aligned, the correspondence between the objects become smoother, which support to calculate an accurate pose.
  • Figure 4: (a) Pose Refinement. Based on the coarse pose as initialization, the reference model can be warped to the target space to obtain the initial mask and extract 3D universal features. Then we optimize the coarse pose and shape by minimizing the loss function. (b) After pose refinement stage, the pose and shape of the reference model are more accurately aligned with the target object.
  • Figure 5: Qualitative results on REAL275 and Wild6D. The red box represents the ground truth, and the green box represents the estimation. Previous methods exhibit large errors when applied to unseen categories due to the significant texture and shape differences. Our method demonstrates strong generalization on unseen categories with accurate pose estimation.
  • ...and 5 more figures