Table of Contents
Fetching ...

iGaussian: Real-Time Camera Pose Estimation via Feed-Forward 3D Gaussian Splatting Inversion

Hao Wang, Linqing Zhao, Xiuwei Xu, Jiwen Lu, Haibin Yan

TL;DR

iGaussian tackles real-time 6DoF pose estimation relative to a prebuilt 3D Gaussian scene by a two-stage feed-forward pipeline. It first regresses a coarse pose from the target image using a Gaussian scene prior with spatial sphere sampling and cross-view attention, then refines via correspondence-based matching and a ViT-enabled translation-scale correction. The method combines a Pose Attention network, a Weighted Multiview Predictor, and a Matching+Solver refinement to bypass expensive render-then-compare loops, delivering robust accuracy across NeRF Synthetic, Mip-NeRF 360, and T\&T+DB while achieving real-time speeds (2.87 FPS). This approach reduces reliance on depth sensors, enhances generalization, and has clear implications for real-time robotics, visual localization, and AR applications.

Abstract

Recent trends in SLAM and visual navigation have embraced 3D Gaussians as the preferred scene representation, highlighting the importance of estimating camera poses from a single image using a pre-built Gaussian model. However, existing approaches typically rely on an iterative \textit{render-compare-refine} loop, where candidate views are first rendered using NeRF or Gaussian Splatting, then compared against the target image, and finally, discrepancies are used to update the pose. This multi-round process incurs significant computational overhead, hindering real-time performance in robotics. In this paper, we propose iGaussian, a two-stage feed-forward framework that achieves real-time camera pose estimation through direct 3D Gaussian inversion. Our method first regresses a coarse 6DoF pose using a Gaussian Scene Prior-based Pose Regression Network with spatial uniform sampling and guided attention mechanisms, then refines it through feature matching and multi-model fusion. The key contribution lies in our cross-correlation module that aligns image embeddings with 3D Gaussian attributes without differentiable rendering, coupled with a Weighted Multiview Predictor that fuses features from Multiple strategically sampled viewpoints. Experimental results on the NeRF Synthetic, Mip-NeRF 360, and T\&T+DB datasets demonstrate a significant performance improvement over previous methods, reducing median rotation errors to 0.2° while achieving 2.87 FPS tracking on mobile robots, which is an impressive 10 times speedup compared to optimization-based approaches. Code: https://github.com/pythongod-exe/iGaussian

iGaussian: Real-Time Camera Pose Estimation via Feed-Forward 3D Gaussian Splatting Inversion

TL;DR

iGaussian tackles real-time 6DoF pose estimation relative to a prebuilt 3D Gaussian scene by a two-stage feed-forward pipeline. It first regresses a coarse pose from the target image using a Gaussian scene prior with spatial sphere sampling and cross-view attention, then refines via correspondence-based matching and a ViT-enabled translation-scale correction. The method combines a Pose Attention network, a Weighted Multiview Predictor, and a Matching+Solver refinement to bypass expensive render-then-compare loops, delivering robust accuracy across NeRF Synthetic, Mip-NeRF 360, and T\&T+DB while achieving real-time speeds (2.87 FPS). This approach reduces reliance on depth sensors, enhances generalization, and has clear implications for real-time robotics, visual localization, and AR applications.

Abstract

Recent trends in SLAM and visual navigation have embraced 3D Gaussians as the preferred scene representation, highlighting the importance of estimating camera poses from a single image using a pre-built Gaussian model. However, existing approaches typically rely on an iterative \textit{render-compare-refine} loop, where candidate views are first rendered using NeRF or Gaussian Splatting, then compared against the target image, and finally, discrepancies are used to update the pose. This multi-round process incurs significant computational overhead, hindering real-time performance in robotics. In this paper, we propose iGaussian, a two-stage feed-forward framework that achieves real-time camera pose estimation through direct 3D Gaussian inversion. Our method first regresses a coarse 6DoF pose using a Gaussian Scene Prior-based Pose Regression Network with spatial uniform sampling and guided attention mechanisms, then refines it through feature matching and multi-model fusion. The key contribution lies in our cross-correlation module that aligns image embeddings with 3D Gaussian attributes without differentiable rendering, coupled with a Weighted Multiview Predictor that fuses features from Multiple strategically sampled viewpoints. Experimental results on the NeRF Synthetic, Mip-NeRF 360, and T\&T+DB datasets demonstrate a significant performance improvement over previous methods, reducing median rotation errors to 0.2° while achieving 2.87 FPS tracking on mobile robots, which is an impressive 10 times speedup compared to optimization-based approaches. Code: https://github.com/pythongod-exe/iGaussian

Paper Structure

This paper contains 29 sections, 15 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of existing pose estimation methods based on (a) NeRF mildenhall2021nerf, (b) 3DGS kerbl20233d, and (c) our method. Both (a) and (b) rely on multiple "render-compare-refine" iterations for optimization, whereas our approach follows a feed-forward paradigm.
  • Figure 2: Overview of iGaussian pipeline. Our approach estimates the camera pose $T_{fine}$ of an observed image $I$ using a two-stage pipeline. First, a Pose Attention Network predicts a coarse 6DoF pose $(R, t)$ from the target image and generates a reference view using a 3D Gaussian representation. Then, a matching and solver module refines the pose by computing the relative transformation between the observed and rendered images. A Transformer-based ViT predicts translation scale and learned correspondence estimation with geometric constraints to enhance accuracy. The framework eliminates iterative rendering, ensuring efficient and precise pose regression.
  • Figure 3: Object-level and scene-level sampling strategies. (a) and (b) illustrate different sampling strategies for objects and scenes, respectively. In both cases, cameras are uniformly distributed on a spherical surface. However, for objects, the camera viewpoints are always directed toward the object’s center, ensuring comprehensive coverage. In contrast, for scenes, the camera viewpoints consistently face away from the scene’s center, capturing a broader environmental context.
  • Figure 4: Results of our tow-stage pose estimation strategy. The first stage employs Gaussian Rendering-Based Coarse-Grained Pose Estimation to estimate coarse camera poses, while the second stage applies Correspondence-Based Pose Optimization for refinement. Blue camera poses denote ground truth, with red and purple coordinates representing coarse estimates and fine-grained optimizations, respectively. The comparison between rendered images (solid) and target images (ghosted) visualizes alignment accuracy.
  • Figure 5: Generalization Test on T&T+DB. We evaluate the model’s performance under varying numbers of input images to determine the optimal quantity for training and assess its robustness to input variations.