Table of Contents
Fetching ...

SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation

Yamei Chen, Yan Di, Guangyao Zhai, Fabian Manhardt, Chenyangguang Zhang, Ruida Zhang, Federico Tombari, Nassir Navab, Benjamin Busam

TL;DR

Sec-ondPose is presented, a novel approach integrating object-specific ge-ometric features with semantic category priors from DI-NOv2 that achieves a 12.4% leap forward over the state-of-the-art on NOCS-REAL275 and still surpasses other competitors by a large margin.

Abstract

Category-level object pose estimation, aiming to predict the 6D pose and 3D size of objects from known categories, typically struggles with large intra-class shape variation. Existing works utilizing mean shapes often fall short of capturing this variation. To address this issue, we present SecondPose, a novel approach integrating object-specific geometric features with semantic category priors from DINOv2. Leveraging the advantage of DINOv2 in providing SE(3)-consistent semantic features, we hierarchically extract two types of SE(3)-invariant geometric features to further encapsulate local-to-global object-specific information. These geometric features are then point-aligned with DINOv2 features to establish a consistent object representation under SE(3) transformations, facilitating the mapping from camera space to the pre-defined canonical space, thus further enhancing pose estimation. Extensive experiments on NOCS-REAL275 demonstrate that SecondPose achieves a 12.4% leap forward over the state-of-the-art. Moreover, on a more complex dataset HouseCat6D which provides photometrically challenging objects, SecondPose still surpasses other competitors by a large margin.

SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation

TL;DR

Sec-ondPose is presented, a novel approach integrating object-specific ge-ometric features with semantic category priors from DI-NOv2 that achieves a 12.4% leap forward over the state-of-the-art on NOCS-REAL275 and still surpasses other competitors by a large margin.

Abstract

Category-level object pose estimation, aiming to predict the 6D pose and 3D size of objects from known categories, typically struggles with large intra-class shape variation. Existing works utilizing mean shapes often fall short of capturing this variation. To address this issue, we present SecondPose, a novel approach integrating object-specific geometric features with semantic category priors from DINOv2. Leveraging the advantage of DINOv2 in providing SE(3)-consistent semantic features, we hierarchically extract two types of SE(3)-invariant geometric features to further encapsulate local-to-global object-specific information. These geometric features are then point-aligned with DINOv2 features to establish a consistent object representation under SE(3) transformations, facilitating the mapping from camera space to the pre-defined canonical space, thus further enhancing pose estimation. Extensive experiments on NOCS-REAL275 demonstrate that SecondPose achieves a 12.4% leap forward over the state-of-the-art. Moreover, on a more complex dataset HouseCat6D which provides photometrically challenging objects, SecondPose still surpasses other competitors by a large margin.
Paper Structure (32 sections, 11 equations, 10 figures, 5 tables)

This paper contains 32 sections, 11 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Categorical SE(3)-consistent features. We visualize our fused features by PCA. Colored points highlight the most corresponding parts, where our proposed feature achieves consistent alignment cross instances (left vs. middle) and maintains consistency on the same instance of different poses (middle vs. right).
  • Figure 2: Illustration of SecondPose. Semantic features are extracted using the DINOv2 model (A), and the HP-PPF feature is computed on the point cloud (B). These features, combined with RGB values, are fused into our SECOND feature $F_f$ (C) using stream-specific modules $L_s$, $L_g$, $L_c$, and a shared module $L_f$ for concatenated features. The resulting fused features, in conjunction with the point cloud, are utilized for pose estimation (D).
  • Figure 3: Hierarchical panel-based geometric features. The inner panel contains points that are close to the point of interest, and outer panels contain points far from the point of interest.
  • Figure 4: Qualitative comparison on REAL275 wang2019normalized. We compare our prediction with ground truth and the prediction of our baseline, VI-Netlin2023vinet. Our approach achieves significantly higher precision in rotation estimation.
  • Figure 5: Qualitative comparison on HouseCat6D jung2023housecat6d. We compare our prediction with ground truth and the prediction of our baseline, VI-Netlin2023vinet.
  • ...and 5 more figures