Table of Contents
Fetching ...

iComMa: Inverting 3D Gaussian Splatting for Camera Pose Estimation via Comparing and Matching

Yuan Sun, Xuan Wang, Yunfan Zhang, Jie Zhang, Caigui Jiang, Yu Guo, Fei Wang

TL;DR

This work tackles 6DoF camera pose estimation without CAD models or training data by inverting a differentiable 3D Gaussian Splatting representation. It fuses a render-and-compare objective with an end-to-end differentiable matching loss (via LoFTR) to handle challenging initializations, using a two-stage optimization in SE(3) with Lie algebra parameterization. Empirical results show that iComMa outperforms iNeRF and standard matching-based methods across diverse synthetic and real-world datasets, while offering substantial speed advantages due to efficient differentiable rendering. The approach provides a practical, robust solution for pose estimation in scenarios with large pose bias and limited texture, enabling faster, more reliable localization for robotics, AR/VR, and SLAM applications.

Abstract

We present a method named iComMa to address the 6D camera pose estimation problem in computer vision. Conventional pose estimation methods typically rely on the target's CAD model or necessitate specific network training tailored to particular object classes. Some existing methods have achieved promising results in mesh-free object and scene pose estimation by inverting the Neural Radiance Fields (NeRF). However, they still struggle with adverse initializations such as large rotations and translations. To address this issue, we propose an efficient method for accurate camera pose estimation by inverting 3D Gaussian Splatting (3DGS). Specifically, a gradient-based differentiable framework optimizes camera pose by minimizing the residual between the query image and the rendered image, requiring no training. An end-to-end matching module is designed to enhance the model's robustness against adverse initializations, while minimizing pixel-level comparing loss aids in precise pose estimation. Experimental results on synthetic and complex real-world data demonstrate the effectiveness of the proposed approach in challenging conditions and the accuracy of camera pose estimation.

iComMa: Inverting 3D Gaussian Splatting for Camera Pose Estimation via Comparing and Matching

TL;DR

This work tackles 6DoF camera pose estimation without CAD models or training data by inverting a differentiable 3D Gaussian Splatting representation. It fuses a render-and-compare objective with an end-to-end differentiable matching loss (via LoFTR) to handle challenging initializations, using a two-stage optimization in SE(3) with Lie algebra parameterization. Empirical results show that iComMa outperforms iNeRF and standard matching-based methods across diverse synthetic and real-world datasets, while offering substantial speed advantages due to efficient differentiable rendering. The approach provides a practical, robust solution for pose estimation in scenarios with large pose bias and limited texture, enabling faster, more reliable localization for robotics, AR/VR, and SLAM applications.

Abstract

We present a method named iComMa to address the 6D camera pose estimation problem in computer vision. Conventional pose estimation methods typically rely on the target's CAD model or necessitate specific network training tailored to particular object classes. Some existing methods have achieved promising results in mesh-free object and scene pose estimation by inverting the Neural Radiance Fields (NeRF). However, they still struggle with adverse initializations such as large rotations and translations. To address this issue, we propose an efficient method for accurate camera pose estimation by inverting 3D Gaussian Splatting (3DGS). Specifically, a gradient-based differentiable framework optimizes camera pose by minimizing the residual between the query image and the rendered image, requiring no training. An end-to-end matching module is designed to enhance the model's robustness against adverse initializations, while minimizing pixel-level comparing loss aids in precise pose estimation. Experimental results on synthetic and complex real-world data demonstrate the effectiveness of the proposed approach in challenging conditions and the accuracy of camera pose estimation.
Paper Structure (26 sections, 5 equations, 5 figures, 4 tables)

This paper contains 26 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Given a query image with an unknown camera pose, iComMa accurately estimates it by inverting 3D Gaussian Splatting from a known initial pose (step=0). The gradient information inherent in the differences between the query image and the rendered image (which are overlaid in the above figure, with a higher degree of overlap indicating more accurate pose estimation) is utilized for iteratively optimizing the camera pose. Compared to iNeRFyen2021inerf, the proposed method not only employs pixel-to-pixel comparing but also utilizes 2D keypoints matching, which are connected by blue lines in the above figure. As a result, our method is capable of precisely estimating camera poses even under poor initial conditions, such as large angular deviations.
  • Figure 2: Overview of iComMa. Given an initial camera pose, iComMa iteratively optimizes to estimate the ground truth pose associated with the query image. For the $t$-th optimization step, we first render the image corresponding to the camera pose $\mathbf{T}_t$ using 3D Gaussian Splatting. Subsequently, we compute the residuals between the rendered image and the query image, which include the matching loss $\mathcal{L}_{Ma}$ obtained from the end-to-end matching module and the per-pixel comparing loss $\mathcal{L}_{Com}$. The entire framework is differentiable, enabling the optimization of camera poses by utilizing the gradients derived from minimizing the residuals.
  • Figure 3: Quantitative Comparison with iNeRF. Different columns represent different types of datasets: the first column represents 8 synthetic datasets, the second column represents 8 forward-facing LLFF datasets, while the third column displays 11 complex $360^\circ$ scene datasets. As for the curves, solid lines represent the results of iComMa, while dashed lines represent those of iNeRF and $\mathrm{iNeRF}^{\dagger}$. The color of the curves indicates different degrees of initialization conditions. Due to space limitation, only the main results are provided. For more details, please refer to the supplementary material.
  • Figure 4: Visualization results of pose estimation. Displaying the pose estimation results by overlaying query images and rendered images. It is noteworthy that for the initial poses, we only provide the rendering results from iComMa, as the initial poses for iNeRF are consistent. For iComMa, during the initial stages of pose optimization, we also visualize the top 20 keypoint matches with the highest confidence detected by LoFTR. Keypoints in the query image are depicted in green, keypoints in the rendered image are depicted in red, and blue lines connect matching keypoints.
  • Figure 5: Ablation experiment visualization. The images depict the overlay of the query image and the rendered image, where a higher degree of overlap indicates more accurate pose estimation. The first row presents the results of iComMa, the second row showcases the results of iComMa without comparing, and the third row exhibits the results of iComMa without matching. The last column provides detailed views of the highlighted regions within the boxes in the images of the fourth column.