SplatPose: Geometry-Aware 6-DoF Pose Estimation from Single RGB Image via 3D Gaussian Splatting
Linqi Yang, Xiongwei Zhao, Qihao Sun, Ke Wang, Ao Chen, Peng Kang
TL;DR
SplatPose tackles the problem of accurate 6-DoF pose estimation from a single RGB image by integrating a differentiable 3D Gaussian Splatting (3DGS) scene representation with a Dual-Attention Ray Scoring Network (DARS-Net) that decouples position and orientation in the geometry domain. The method employs a coarse-to-fine pipeline: DARS-Net selects rays to estimate a coarse camera pose, followed by a refinement stage that uses LoFTR-based 2D-2D correspondences and PnP to tighten the pose, achieving state-of-the-art performance on Mip-NeRF 360°, Tanks & Temples, and 12Scenes with single-view input. Key innovations include dual-attention ray scoring to mitigate rotational ambiguity, differentiable 3DGS rendering for efficient view synthesis, and a practical refinement loop that leverages dense feature matching. The approach reduces memory and data requirements compared to depth- and multi-view methods while delivering competitive or superior accuracy, enabling robust real-time-like pose estimation in challenging scenes.
Abstract
6-DoF pose estimation is a fundamental task in computer vision with wide-ranging applications in augmented reality and robotics. Existing single RGB-based methods often compromise accuracy due to their reliance on initial pose estimates and susceptibility to rotational ambiguity, while approaches requiring depth sensors or multi-view setups incur significant deployment costs. To address these limitations, we introduce SplatPose, a novel framework that synergizes 3D Gaussian Splatting (3DGS) with a dual-branch neural architecture to achieve high-precision pose estimation using only a single RGB image. Central to our approach is the Dual-Attention Ray Scoring Network (DARS-Net), which innovatively decouples positional and angular alignment through geometry-domain attention mechanisms, explicitly modeling directional dependencies to mitigate rotational ambiguity. Additionally, a coarse-to-fine optimization pipeline progressively refines pose estimates by aligning dense 2D features between query images and 3DGS-synthesized views, effectively correcting feature misalignment and depth errors from sparse ray sampling. Experiments on three benchmark datasets demonstrate that SplatPose achieves state-of-the-art 6-DoF pose estimation accuracy in single RGB settings, rivaling approaches that depend on depth or multi-view images.
