Table of Contents
Fetching ...

GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization

Gennady Sidorov, Malik Mohrat, Denis Gridusov, Ruslan Rakhimov, Sergey Kolyubin

TL;DR

GSplatLoc addresses localization efficiency and accuracy by fusing a structure-based coarse pose with descriptor-embedded 3D Gaussian Splatting and a rendering-based refinement. It builds a scene representation via feature distillation, derives a coarse 2D-3D pose with PnP+RANSAC, and refines it through test-time photometric warping. Across indoor and outdoor benchmarks, it achieves state-of-the-art results among neural render pose methods indoors and surpasses SCR-based ACE outdoors, using only RGB input and offering fast runtimes. The work demonstrates the effectiveness of combining structure-based matching with differentiable rendering for robust, real-time localization in dynamic environments.

Abstract

Although various visual localization approaches exist, such as scene coordinate regression and camera pose regression, these methods often struggle with optimization complexity or limited accuracy. To address these challenges, we explore the use of novel view synthesis techniques, particularly 3D Gaussian Splatting (3DGS), which enables the compact encoding of both 3D geometry and scene appearance. We propose a two-stage procedure that integrates dense and robust keypoint descriptors from the lightweight XFeat feature extractor into 3DGS, enhancing performance in both indoor and outdoor environments. The coarse pose estimates are directly obtained via 2D-3D correspondences between the 3DGS representation and query image descriptors. In the second stage, the initial pose estimate is refined by minimizing the rendering-based photometric warp loss. Benchmarking on widely used indoor and outdoor datasets demonstrates improvements over recent neural rendering-based localization methods, such as NeRFMatch and PNeRFLoc.

GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization

TL;DR

GSplatLoc addresses localization efficiency and accuracy by fusing a structure-based coarse pose with descriptor-embedded 3D Gaussian Splatting and a rendering-based refinement. It builds a scene representation via feature distillation, derives a coarse 2D-3D pose with PnP+RANSAC, and refines it through test-time photometric warping. Across indoor and outdoor benchmarks, it achieves state-of-the-art results among neural render pose methods indoors and surpasses SCR-based ACE outdoors, using only RGB input and offering fast runtimes. The work demonstrates the effectiveness of combining structure-based matching with differentiable rendering for robust, real-time localization in dynamic environments.

Abstract

Although various visual localization approaches exist, such as scene coordinate regression and camera pose regression, these methods often struggle with optimization complexity or limited accuracy. To address these challenges, we explore the use of novel view synthesis techniques, particularly 3D Gaussian Splatting (3DGS), which enables the compact encoding of both 3D geometry and scene appearance. We propose a two-stage procedure that integrates dense and robust keypoint descriptors from the lightweight XFeat feature extractor into 3DGS, enhancing performance in both indoor and outdoor environments. The coarse pose estimates are directly obtained via 2D-3D correspondences between the 3DGS representation and query image descriptors. In the second stage, the initial pose estimate is refined by minimizing the rendering-based photometric warp loss. Benchmarking on widely used indoor and outdoor datasets demonstrates improvements over recent neural rendering-based localization methods, such as NeRFMatch and PNeRFLoc.
Paper Structure (7 sections, 6 equations, 5 figures, 5 tables)

This paper contains 7 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: GSplatLoc constructs a 3D Gaussian Splatting (3DGS) model with distilled descriptor features. For localization, the initial coarse pose is estimated through structural matching with these features and refined during test-time optimization using rendering-based photometric warp loss to enhance accuracy.
  • Figure 2: Overview of the GSplatLoc Base pipeline. First, we model the scene using a feature-based 3D Gaussian Splatting (3DGS) approach, leveraging the XFeat potje2024cvpr network for feature extraction and distillation. In the test stage, the initial coarse pose is estimated by matching 2D keypoints from the query image to 3D features in the 3DGS model, which is then refined using a Perspective-n-Point (PnP) solver within a RANSAC loop. We then refine the coarse pose by aligning the image rendered from 3DGS with the input query image using an RGB warping loss. This process enhances pose accuracy via test-time optimization.
  • Figure 3: Test-time camera pose refinement aligns the rendered images to the query image at different optimization iterations. The first row shows the rendered images blended with the query image based on the estimated pose at each step, while the second row visualizes the absolute errors between the two, demonstrating how the warping loss reduces this error over time, thereby improving pose accuracy.
  • Figure 4: Camera pose optimization via the Base variant is performed using a rendering-based photometric warp loss, progressively enhancing accuracy. The plot shows the percentage of frames that fall below the $\text{1cm}/1^\circ$ threshold, highlighting the improvement in accuracy over iterations.
  • Figure 5: Qualitative results on Our Custom Dataset. The diagonal line separates the test query images from the renders synthesized using poses estimated by GSplatLoc Fine.