Table of Contents
Fetching ...

Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting

Huaqi Tao, Bingxi Liu, Guangcheng Chen, Fulin Tang, Li He, Hong Zhang

Abstract

Visual relocalization is a fundamental task in the field of 3D computer vision, estimating a camera's pose when it revisits a previously known scene. While point-based hierarchical relocalization methods have shown strong scalability and efficiency, they are often limited by sparse image observations and weak feature matching. In this work, we propose SplatHLoc, a novel hierarchical visual relocalization framework that uses Feature Gaussian Splatting as the scene representation. To address the sparsity of database images, we propose an adaptive viewpoint retrieval method that synthesizes virtual candidates with viewpoints more closely aligned with the query, thereby improving the accuracy of initial pose estimation. For feature matching, we observe that Gaussian-rendered features and those extracted directly from images exhibit different strengths across the two-stage matching process: the former performs better in the coarse stage, while the latter proves more effective in the fine stage. Therefore, we introduce a hybrid feature matching strategy, enabling more accurate and efficient pose estimation. Extensive experiments on both indoor and outdoor datasets show that SplatHLoc enhances the robustness of visual relocalization, setting a new state-of-the-art.

Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting

Abstract

Visual relocalization is a fundamental task in the field of 3D computer vision, estimating a camera's pose when it revisits a previously known scene. While point-based hierarchical relocalization methods have shown strong scalability and efficiency, they are often limited by sparse image observations and weak feature matching. In this work, we propose SplatHLoc, a novel hierarchical visual relocalization framework that uses Feature Gaussian Splatting as the scene representation. To address the sparsity of database images, we propose an adaptive viewpoint retrieval method that synthesizes virtual candidates with viewpoints more closely aligned with the query, thereby improving the accuracy of initial pose estimation. For feature matching, we observe that Gaussian-rendered features and those extracted directly from images exhibit different strengths across the two-stage matching process: the former performs better in the coarse stage, while the latter proves more effective in the fine stage. Therefore, we introduce a hybrid feature matching strategy, enabling more accurate and efficient pose estimation. Extensive experiments on both indoor and outdoor datasets show that SplatHLoc enhances the robustness of visual relocalization, setting a new state-of-the-art.

Paper Structure

This paper contains 19 sections, 20 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: SplatHLoc: a novel hierarchical visual relocalization framework based on Feature Gaussian Splatting (FGS). FGS renders color, depth, and feature maps from novel views, which our method exploits to improve the image retrieval and feature matching process. Upon retrieving a reference image, we match it to the query to estimate an initial pose (initial relocalization). We then render views from the estimated pose and iteratively match them to the query to refine the pose (refined relocalization).
  • Figure 2: An overview of the proposed SplatHLoc framework. Starting from a database of reference images, we build an SfM model to initialize the Gaussian primitives, and then training the FGS map, see Section \ref{['3.1']}. SplatHLoc follows a hierarchical relocalization pipeline. (a) In the retrieval stage, we propose an adaptive coarse-to-fine viewpoint retrieval strategy. We first perform the coarse retrieval to obtain retrieved images and then use a lightweight feature matcher to perform geometric verification for each query–retrieved image pair. If geometric verification yields fewer inliers than a threshold, we perform the fine viewpoint retrieval to get a fine retrieved pose, see Section \ref{['3.2']}. (b) In the matching stage, we apply the proposed hybrid feature matching strategy to establish 2D–2D correspondences between the query and the retrieved image, see Section \ref{['3.3']}. Next, the rendered depth map lifts the 2D–2D matches to 2D–3D. We then estimate an initial pose using RANSAC-PnP. Rendering from the estimated pose and repeating the matching stage enables pose refinement. The overall relocalization process is further discussed in Section \ref{['3.4']}.
  • Figure 3: Illustration of the FGS training process. Feature decoder $d$ is introduced to reduce the dimensionality of the rendered feature $F_r^{\text{low}}$ for improved efficiency and reduced map size.
  • Figure 4: Visualization of the relocalization errors. Each subfigure is divided by a diagonal: the top-right part shows the query image in grayscale, while the bottom-left part shows the rendered image from the estimated pose. The red dashed boxes highlight regions with pronounced visual differences in each column. More visualizations and details are in the supplementary material.
  • Figure 5: Qualitative comparison of camera pose estimation errors between HLoc sarlin2019coarse and our proposed SplatHLoc across five scenes from the 7-Scenes dataset. Visualizations of the remaining two scenes and more details are provided in the supplementary material. For each scene, we visualize the reconstructed point cloud map together with the trajectory of query images. Trajectory colors denote position error, while the color bar below shows rotation errors, with numbers indicating image indices.
  • ...and 9 more figures