Table of Contents
Fetching ...

LaPose: Laplacian Mixture Shape Modeling for RGB-Based Category-Level Object Pose Estimation

Ruida Zhang, Ziqin Huang, Gu Wang, Chenyangguang Zhang, Yan Di, Xingxing Zuo, Jiwen Tang, Xiangyang Ji

TL;DR

LaPose is proposed, a novel framework that models the object shape as the Laplacian mixture model for Pose estimation, and introduces a scale-agnostic representation for object size and translation, enhancing training efficiency and overall robustness.

Abstract

While RGBD-based methods for category-level object pose estimation hold promise, their reliance on depth data limits their applicability in diverse scenarios. In response, recent efforts have turned to RGB-based methods; however, they face significant challenges stemming from the absence of depth information. On one hand, the lack of depth exacerbates the difficulty in handling intra-class shape variation, resulting in increased uncertainty in shape predictions. On the other hand, RGB-only inputs introduce inherent scale ambiguity, rendering the estimation of object size and translation an ill-posed problem. To tackle these challenges, we propose LaPose, a novel framework that models the object shape as the Laplacian mixture model for Pose estimation. By representing each point as a probabilistic distribution, we explicitly quantify the shape uncertainty. LaPose leverages both a generalized 3D information stream and a specialized feature stream to independently predict the Laplacian distribution for each point, capturing different aspects of object geometry. These two distributions are then integrated as a Laplacian mixture model to establish the 2D-3D correspondences, which are utilized to solve the pose via the PnP module. In order to mitigate scale ambiguity, we introduce a scale-agnostic representation for object size and translation, enhancing training efficiency and overall robustness. Extensive experiments on the NOCS datasets validate the effectiveness of LaPose, yielding state-of-the-art performance in RGB-based category-level object pose estimation. Codes are released at https://github.com/lolrudy/LaPose

LaPose: Laplacian Mixture Shape Modeling for RGB-Based Category-Level Object Pose Estimation

TL;DR

LaPose is proposed, a novel framework that models the object shape as the Laplacian mixture model for Pose estimation, and introduces a scale-agnostic representation for object size and translation, enhancing training efficiency and overall robustness.

Abstract

While RGBD-based methods for category-level object pose estimation hold promise, their reliance on depth data limits their applicability in diverse scenarios. In response, recent efforts have turned to RGB-based methods; however, they face significant challenges stemming from the absence of depth information. On one hand, the lack of depth exacerbates the difficulty in handling intra-class shape variation, resulting in increased uncertainty in shape predictions. On the other hand, RGB-only inputs introduce inherent scale ambiguity, rendering the estimation of object size and translation an ill-posed problem. To tackle these challenges, we propose LaPose, a novel framework that models the object shape as the Laplacian mixture model for Pose estimation. By representing each point as a probabilistic distribution, we explicitly quantify the shape uncertainty. LaPose leverages both a generalized 3D information stream and a specialized feature stream to independently predict the Laplacian distribution for each point, capturing different aspects of object geometry. These two distributions are then integrated as a Laplacian mixture model to establish the 2D-3D correspondences, which are utilized to solve the pose via the PnP module. In order to mitigate scale ambiguity, we introduce a scale-agnostic representation for object size and translation, enhancing training efficiency and overall robustness. Extensive experiments on the NOCS datasets validate the effectiveness of LaPose, yielding state-of-the-art performance in RGB-based category-level object pose estimation. Codes are released at https://github.com/lolrudy/LaPose
Paper Structure (22 sections, 9 equations, 6 figures, 4 tables)

This paper contains 22 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Two main challenges of RGB-based category-level object pose estimation. (A) The lack of depth information exacerbates the difficulty in handling intra-class shape variation. The length of the camera lens is uncertain in the front view. (B) The RGB-only inputs introduce scale ambiguity. Laptops of various sizes have identical appearance in the image.
  • Figure 2: Method overview. i) Given an RGB image, we adopt a detector to crop the object of interest. The input image is then processed by ii) generalized 3D information stream supported by DINOv2 and iii) specialized feature stream utilizing a convolutional network to extract features $\mathcal{F}_{dino}, \mathcal{F}_{conv}$. iv) The Laplacian mixture model of the NOCS coordinate map is obtained by combining the Laplacian distributions $Laplace(\mu_{dino}, \sigma_{dino}^2)$ and $Laplace(\mu_{conv}, \sigma_{conv}^2)$ predicted by both streams. The subsequent PnP module $\Phi$ solves the translation and rotation from 2D-3D correspondences established by the Laplacian mixture model. v) Meanwhile, the size head takes $\mathcal{F}_{dino}, \mathcal{F}_{conv}$ as input and predicts the object size. vi) Finally, the scale-agnostic 9DoF pose parameters are obtained.
  • Figure 3: (A) Illustration of scale ambiguity: Objects of various scales exhibit identical appearances in the image. We propose Scale-Agnostic Pose representation (SAP) by normalizing the scale such that the diagonal length of the object tight bounding box is 1. (B) Average Precision on 3D IoU under different thresholds with or without SAP.
  • Figure 4: Visualization of the predicted Laplacian distribution means $\mu$ and variances $\sigma^2$. In regions where NOCS errors are pronounced, the variance $\sigma^2$ tends to be higher.
  • Figure 5: Qualitative results of LaPose (green line) and DMSR (blue line) on NOCS-REAL275. Images (a)-(f) demonstrate 2D segmentation results.
  • ...and 1 more figures