iGaussian: Real-Time Camera Pose Estimation via Feed-Forward 3D Gaussian Splatting Inversion

Hao Wang; Linqing Zhao; Xiuwei Xu; Jiwen Lu; Haibin Yan

iGaussian: Real-Time Camera Pose Estimation via Feed-Forward 3D Gaussian Splatting Inversion

Hao Wang, Linqing Zhao, Xiuwei Xu, Jiwen Lu, Haibin Yan

TL;DR

iGaussian tackles real-time 6DoF pose estimation relative to a prebuilt 3D Gaussian scene by a two-stage feed-forward pipeline. It first regresses a coarse pose from the target image using a Gaussian scene prior with spatial sphere sampling and cross-view attention, then refines via correspondence-based matching and a ViT-enabled translation-scale correction. The method combines a Pose Attention network, a Weighted Multiview Predictor, and a Matching+Solver refinement to bypass expensive render-then-compare loops, delivering robust accuracy across NeRF Synthetic, Mip-NeRF 360, and T\&T+DB while achieving real-time speeds (2.87 FPS). This approach reduces reliance on depth sensors, enhances generalization, and has clear implications for real-time robotics, visual localization, and AR applications.

Abstract

Recent trends in SLAM and visual navigation have embraced 3D Gaussians as the preferred scene representation, highlighting the importance of estimating camera poses from a single image using a pre-built Gaussian model. However, existing approaches typically rely on an iterative \textit{render-compare-refine} loop, where candidate views are first rendered using NeRF or Gaussian Splatting, then compared against the target image, and finally, discrepancies are used to update the pose. This multi-round process incurs significant computational overhead, hindering real-time performance in robotics. In this paper, we propose iGaussian, a two-stage feed-forward framework that achieves real-time camera pose estimation through direct 3D Gaussian inversion. Our method first regresses a coarse 6DoF pose using a Gaussian Scene Prior-based Pose Regression Network with spatial uniform sampling and guided attention mechanisms, then refines it through feature matching and multi-model fusion. The key contribution lies in our cross-correlation module that aligns image embeddings with 3D Gaussian attributes without differentiable rendering, coupled with a Weighted Multiview Predictor that fuses features from Multiple strategically sampled viewpoints. Experimental results on the NeRF Synthetic, Mip-NeRF 360, and T\&T+DB datasets demonstrate a significant performance improvement over previous methods, reducing median rotation errors to 0.2° while achieving 2.87 FPS tracking on mobile robots, which is an impressive 10 times speedup compared to optimization-based approaches. Code: https://github.com/pythongod-exe/iGaussian

iGaussian: Real-Time Camera Pose Estimation via Feed-Forward 3D Gaussian Splatting Inversion

TL;DR

Abstract

iGaussian: Real-Time Camera Pose Estimation via Feed-Forward 3D Gaussian Splatting Inversion

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)