Table of Contents
Fetching ...

GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation

Zinqin Huang, Gu Wang, Chenyangguang Zhang, Ruida Zhang, Xiu Li, Xiangyang Ji

TL;DR

GIVEPose tackles RGB-based category-level object pose estimation by addressing intra-class variation that arises when regressing pose from the NOCS map. It introduces the Intra-class Variation-Free Consensus (IVFC) map, derived from a category-consensus model, and a Deformable Convolutional Auto-Encoder (DCAE) that gradually eliminates instance-specific information from the NOCS map to produce the IVFC map. The pose is then regressed from the IVFC map combined with 2D ROI information, while object size is inferred from backbone features, enabling end-to-end RGB-only category-level pose estimation. Evaluations on CAMERA25, REAL275, and Wild6D demonstrate substantial improvements over prior RGB-based methods, with code released to support reproducibility, and the approach offers robust handling of intra-class variation and truncation in real-world scenarios.

Abstract

Recent advances in RGBD-based category-level object pose estimation have been limited by their reliance on precise depth information, restricting their broader applicability. In response, RGB-based methods have been developed. Among these methods, geometry-guided pose regression that originated from instance-level tasks has demonstrated strong performance. However, we argue that the NOCS map is an inadequate intermediate representation for geometry-guided pose regression method, as its many-to-one correspondence with category-level pose introduces redundant instance-specific information, resulting in suboptimal results. This paper identifies the intra-class variation problem inherent in pose regression based solely on the NOCS map and proposes the Intra-class Variation-Free Consensus (IVFC) map, a novel coordinate representation generated from the category-level consensus model. By leveraging the complementary strengths of the NOCS map and the IVFC map, we introduce GIVEPose, a framework that implements Gradual Intra-class Variation Elimination for category-level object pose estimation. Extensive evaluations on both synthetic and real-world datasets demonstrate that GIVEPose significantly outperforms existing state-of-the-art RGB-based approaches, achieving substantial improvements in category-level object pose estimation. Our code is available at https://github.com/ziqin-h/GIVEPose.

GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation

TL;DR

GIVEPose tackles RGB-based category-level object pose estimation by addressing intra-class variation that arises when regressing pose from the NOCS map. It introduces the Intra-class Variation-Free Consensus (IVFC) map, derived from a category-consensus model, and a Deformable Convolutional Auto-Encoder (DCAE) that gradually eliminates instance-specific information from the NOCS map to produce the IVFC map. The pose is then regressed from the IVFC map combined with 2D ROI information, while object size is inferred from backbone features, enabling end-to-end RGB-only category-level pose estimation. Evaluations on CAMERA25, REAL275, and Wild6D demonstrate substantial improvements over prior RGB-based methods, with code released to support reproducibility, and the approach offers robust handling of intra-class variation and truncation in real-world scenarios.

Abstract

Recent advances in RGBD-based category-level object pose estimation have been limited by their reliance on precise depth information, restricting their broader applicability. In response, RGB-based methods have been developed. Among these methods, geometry-guided pose regression that originated from instance-level tasks has demonstrated strong performance. However, we argue that the NOCS map is an inadequate intermediate representation for geometry-guided pose regression method, as its many-to-one correspondence with category-level pose introduces redundant instance-specific information, resulting in suboptimal results. This paper identifies the intra-class variation problem inherent in pose regression based solely on the NOCS map and proposes the Intra-class Variation-Free Consensus (IVFC) map, a novel coordinate representation generated from the category-level consensus model. By leveraging the complementary strengths of the NOCS map and the IVFC map, we introduce GIVEPose, a framework that implements Gradual Intra-class Variation Elimination for category-level object pose estimation. Extensive evaluations on both synthetic and real-world datasets demonstrate that GIVEPose significantly outperforms existing state-of-the-art RGB-based approaches, achieving substantial improvements in category-level object pose estimation. Our code is available at https://github.com/ziqin-h/GIVEPose.

Paper Structure

This paper contains 24 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of different coordinate representations as intermediate supervision. (a) Category-level pose prediction from the NOCS map alone suffers from intra-class variation and redundancy effects. (b) The prediction of intermediate coordinate maps becomes intractable when utilizing IVFC map exclusively, due to its lack of pixel-wise alignment with input images. (c) Our proposed gradual intra-class variation elimination strategy leverages the complementary advantages of both maps, enabling more precise category-level pose estimation.
  • Figure 2: Overview of our proposed GIVEPose. The core of our framework lies in the DCAE-based module, which facilitates the bridging of the NOCS map and our proposed IVFC map. Leveraging the NOCS map estimated from the backbone features, we employ a deformable convolutional encoder to selectively distill the information, thereby enabling the reconstruction of the IVFC map. This process gradually eliminates intra-class variations, ultimately yielding a more robust representation. By fusing the estimated IVFC map with the 2D positional information of the Region of Interest (ROI), we employ a lightweight rotation and translation ($R,t$) predictor. Concurrently, the object size is directly inferred from the backbone features.
  • Figure 3: Illustration of the relationship between the NOCS and IVFC maps. Both coordinate maps are generated from color-coded NOCS models via perspective projection. The key difference lies in their origins: the NOCS map is derived from instance-specific models, whereas the IVFC map is derived from a category-consensus model. The IVFC map represents a category-level shared coordinate system corresponding to category-level pose. Transforming NOCS maps to the IVFC map under the same pose eliminates intra-class variations.
  • Figure 4: Comparison of mean average precision under various thresholds for scale-agnostic rotation and translation in direct pose estimation from different ground-truth coordinate maps: NOCS map (red) vs. IVFC map (blue).
  • Figure 5: Qualitative comparison with LaPose zhang2024lapose and DMSR wei2023rgb on REAL275. Red and green boxes denote the GT and predicted results. For the axis projections, darker shades indicate the ground truth, while lighter shades correspond to the predicted results.