Table of Contents
Fetching ...

Object Gaussian for Monocular 6D Pose Estimation from Sparse Views

Luqing Luo, Shichu Sun, Jiangang Yang, Linfang Zheng, Jinwei Du, Jian Liu

TL;DR

SGPose removes the dependence on CAD models by regressing dense 2D-3D correspondences between images and the reconstructed model from sparse input and random initialization, while the geometric-consistent depth supervision and online synthetic view warping are key to the success.

Abstract

Monocular object pose estimation, as a pivotal task in computer vision and robotics, heavily depends on accurate 2D-3D correspondences, which often demand costly CAD models that may not be readily available. Object 3D reconstruction methods offer an alternative, among which recent advancements in 3D Gaussian Splatting (3DGS) afford a compelling potential. Yet its performance still suffers and tends to overfit with fewer input views. Embracing this challenge, we introduce SGPose, a novel framework for sparse view object pose estimation using Gaussian-based methods. Given as few as ten views, SGPose generates a geometric-aware representation by starting with a random cuboid initialization, eschewing reliance on Structure-from-Motion (SfM) pipeline-derived geometry as required by traditional 3DGS methods. SGPose removes the dependence on CAD models by regressing dense 2D-3D correspondences between images and the reconstructed model from sparse input and random initialization, while the geometric-consistent depth supervision and online synthetic view warping are key to the success. Experiments on typical benchmarks, especially on the Occlusion LM-O dataset, demonstrate that SGPose outperforms existing methods even under sparse view constraints, under-scoring its potential in real-world applications.

Object Gaussian for Monocular 6D Pose Estimation from Sparse Views

TL;DR

SGPose removes the dependence on CAD models by regressing dense 2D-3D correspondences between images and the reconstructed model from sparse input and random initialization, while the geometric-consistent depth supervision and online synthetic view warping are key to the success.

Abstract

Monocular object pose estimation, as a pivotal task in computer vision and robotics, heavily depends on accurate 2D-3D correspondences, which often demand costly CAD models that may not be readily available. Object 3D reconstruction methods offer an alternative, among which recent advancements in 3D Gaussian Splatting (3DGS) afford a compelling potential. Yet its performance still suffers and tends to overfit with fewer input views. Embracing this challenge, we introduce SGPose, a novel framework for sparse view object pose estimation using Gaussian-based methods. Given as few as ten views, SGPose generates a geometric-aware representation by starting with a random cuboid initialization, eschewing reliance on Structure-from-Motion (SfM) pipeline-derived geometry as required by traditional 3DGS methods. SGPose removes the dependence on CAD models by regressing dense 2D-3D correspondences between images and the reconstructed model from sparse input and random initialization, while the geometric-consistent depth supervision and online synthetic view warping are key to the success. Experiments on typical benchmarks, especially on the Occlusion LM-O dataset, demonstrate that SGPose outperforms existing methods even under sparse view constraints, under-scoring its potential in real-world applications.
Paper Structure (27 sections, 18 equations, 5 figures, 7 tables)

This paper contains 27 sections, 18 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The alpha-blended depth $d^{\alpha}$ integrates depth across Gaussian primitives along the ray, the peak depth $d^{peak}$ selects the one of highest opacity. $d^{\alpha}$ enables reliable online synthetic view warping without leveraging external depth information, in conjunction with $d^{peak}$ guides the online pruning, both of them contribute to object Gaussian reconstruction under sparse views.
  • Figure 2: SGPose Pipeline. Given sparse RGB images and a cuboid random initialization, the object Gaussian learns the geometry of target objects under the supervision of geometric-consistency, to render synthetic views, including both of individual object images and occluded objects images, masks and dense 2D-3D correspondences. The image rendering loss $\mathcal{L}_{image}$, image warping loss $\mathcal{L}_{warp}$ and geometric-consistent loss $\mathcal{L}_{geo}$ are used to guide the learning process. For pose estimation, the objects are detected and cropped from test images by detector redmon2018yolov3, the above rendering results, as the replacement of CAD models, are feed to pose estimator liu2022gdrnpp_bop for regression.
  • Figure 3: Qualitative results on LM. Column (a) and (f) show the CAD models. Column (b) and (g) illustrate the predicted object poses (green) and ground truth poses (blue). Column (c) and (h) are the rendered images from our object Gaussian. Column (d) and (i) are the generated 2D-3D correspondence maps from our object Gaussian, which are also g.t. of regression network. Column (e) and (j) are the predicted 2D-3D correspondence maps from our regression network.
  • Figure 4: Qualitative results for each object of synthetic data for the LM-O dataset, where the 2D-3D correspondences are projected onto the target object for visualization.
  • Figure 5: Qualitative results for selected synthetic views. Column (a) and (e) display ground truth of given views; (b) and (f) show SGPose rendered images of given views. Synthetic view ground truths are in (c) and (g), with their corresponding rendered images in (d) and (h). Synthetic views are generated by applying rotation perturbations of up to ±15° and translation perturbations of ±0.01 m along the x and y axes, and ±0.05 m along the z-axis to the given views.