Table of Contents
Fetching ...

SplatPose: Geometry-Aware 6-DoF Pose Estimation from Single RGB Image via 3D Gaussian Splatting

Linqi Yang, Xiongwei Zhao, Qihao Sun, Ke Wang, Ao Chen, Peng Kang

TL;DR

SplatPose tackles the problem of accurate 6-DoF pose estimation from a single RGB image by integrating a differentiable 3D Gaussian Splatting (3DGS) scene representation with a Dual-Attention Ray Scoring Network (DARS-Net) that decouples position and orientation in the geometry domain. The method employs a coarse-to-fine pipeline: DARS-Net selects rays to estimate a coarse camera pose, followed by a refinement stage that uses LoFTR-based 2D-2D correspondences and PnP to tighten the pose, achieving state-of-the-art performance on Mip-NeRF 360°, Tanks & Temples, and 12Scenes with single-view input. Key innovations include dual-attention ray scoring to mitigate rotational ambiguity, differentiable 3DGS rendering for efficient view synthesis, and a practical refinement loop that leverages dense feature matching. The approach reduces memory and data requirements compared to depth- and multi-view methods while delivering competitive or superior accuracy, enabling robust real-time-like pose estimation in challenging scenes.

Abstract

6-DoF pose estimation is a fundamental task in computer vision with wide-ranging applications in augmented reality and robotics. Existing single RGB-based methods often compromise accuracy due to their reliance on initial pose estimates and susceptibility to rotational ambiguity, while approaches requiring depth sensors or multi-view setups incur significant deployment costs. To address these limitations, we introduce SplatPose, a novel framework that synergizes 3D Gaussian Splatting (3DGS) with a dual-branch neural architecture to achieve high-precision pose estimation using only a single RGB image. Central to our approach is the Dual-Attention Ray Scoring Network (DARS-Net), which innovatively decouples positional and angular alignment through geometry-domain attention mechanisms, explicitly modeling directional dependencies to mitigate rotational ambiguity. Additionally, a coarse-to-fine optimization pipeline progressively refines pose estimates by aligning dense 2D features between query images and 3DGS-synthesized views, effectively correcting feature misalignment and depth errors from sparse ray sampling. Experiments on three benchmark datasets demonstrate that SplatPose achieves state-of-the-art 6-DoF pose estimation accuracy in single RGB settings, rivaling approaches that depend on depth or multi-view images.

SplatPose: Geometry-Aware 6-DoF Pose Estimation from Single RGB Image via 3D Gaussian Splatting

TL;DR

SplatPose tackles the problem of accurate 6-DoF pose estimation from a single RGB image by integrating a differentiable 3D Gaussian Splatting (3DGS) scene representation with a Dual-Attention Ray Scoring Network (DARS-Net) that decouples position and orientation in the geometry domain. The method employs a coarse-to-fine pipeline: DARS-Net selects rays to estimate a coarse camera pose, followed by a refinement stage that uses LoFTR-based 2D-2D correspondences and PnP to tighten the pose, achieving state-of-the-art performance on Mip-NeRF 360°, Tanks & Temples, and 12Scenes with single-view input. Key innovations include dual-attention ray scoring to mitigate rotational ambiguity, differentiable 3DGS rendering for efficient view synthesis, and a practical refinement loop that leverages dense feature matching. The approach reduces memory and data requirements compared to depth- and multi-view methods while delivering competitive or superior accuracy, enabling robust real-time-like pose estimation in challenging scenes.

Abstract

6-DoF pose estimation is a fundamental task in computer vision with wide-ranging applications in augmented reality and robotics. Existing single RGB-based methods often compromise accuracy due to their reliance on initial pose estimates and susceptibility to rotational ambiguity, while approaches requiring depth sensors or multi-view setups incur significant deployment costs. To address these limitations, we introduce SplatPose, a novel framework that synergizes 3D Gaussian Splatting (3DGS) with a dual-branch neural architecture to achieve high-precision pose estimation using only a single RGB image. Central to our approach is the Dual-Attention Ray Scoring Network (DARS-Net), which innovatively decouples positional and angular alignment through geometry-domain attention mechanisms, explicitly modeling directional dependencies to mitigate rotational ambiguity. Additionally, a coarse-to-fine optimization pipeline progressively refines pose estimates by aligning dense 2D features between query images and 3DGS-synthesized views, effectively correcting feature misalignment and depth errors from sparse ray sampling. Experiments on three benchmark datasets demonstrate that SplatPose achieves state-of-the-art 6-DoF pose estimation accuracy in single RGB settings, rivaling approaches that depend on depth or multi-view images.

Paper Structure

This paper contains 16 sections, 9 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison of pose estimation between 6DGS bortolon20246dgs and SplatPose. 6DGS selects high-scoring rays solely based on proximity to the camera's optical center, while SplatPose, via DARS-Net, refines pose estimation by incorporating both high-position-scoring rays (closer to the optical center) and high-orientation-scoring rays (aligned with the camera orientation), ultimately achieving smaller rotational errors compared to 6DGS.
  • Figure 2: An overview of our SplatPose pipeline. Our framework is composed of three key stages: (1) 3D Gaussian Scene Representation, where a 3DGS scene map is constructed from sparse point clouds to initialize the scene representation; (2) DARS-Net and Coarse Estimation, which decouples ray scoring into translation and rotation attention mechanisms, independently computing position and orientation scores for cast rays, selecting top-k rays based on these scores, and leveraging them to estimate the camera's position and orientation; and (3) Pose Refinement, where a synthetic scene view is rendered using the coarse pose, and keypoints are matched between the rendered view and the query image to refine the camera pose.
  • Figure 3: The illustration presents qualitative results from the Mip-NeRF 360° dataset ((a) and (b)) and the Tanks & Temples dataset ((c) and (d)). From top to bottom, there are results of 6DGS bortolon20246dgs, ours, and ground truth. For each scene, the images are rendered based on the estimated camera poses utilizing the provided 3DGS model.