Table of Contents
Fetching ...

SplatLoc: 3D Gaussian Splatting-based Visual Localization for Augmented Reality

Hongjia Zhai, Xiyu Zhang, Boming Zhao, Hai Li, Yijia He, Zhaopeng Cui, Hujun Bao, Guofeng Zhang

TL;DR

This work develops an unbiased 3D scene-specific descriptor decoder for Gaussian primitives, distilled from a constructed feature volume, and introduces a salient 3D landmark selection algorithm that selects a suitable primitive subset based on the saliency score for localization.

Abstract

Visual localization plays an important role in the applications of Augmented Reality (AR), which enable AR devices to obtain their 6-DoF pose in the pre-build map in order to render virtual content in real scenes. However, most existing approaches can not perform novel view rendering and require large storage capacities for maps. To overcome these limitations, we propose an efficient visual localization method capable of high-quality rendering with fewer parameters. Specifically, our approach leverages 3D Gaussian primitives as the scene representation. To ensure precise 2D-3D correspondences for pose estimation, we develop an unbiased 3D scene-specific descriptor decoder for Gaussian primitives, distilled from a constructed feature volume. Additionally, we introduce a salient 3D landmark selection algorithm that selects a suitable primitive subset based on the saliency score for localization. We further regularize key Gaussian primitives to prevent anisotropic effects, which also improves localization performance. Extensive experiments on two widely used datasets demonstrate that our method achieves superior or comparable rendering and localization performance to state-of-the-art implicit-based visual localization approaches. Project page: \href{https://zju3dv.github.io/splatloc}{https://zju3dv.github.io/splatloc}.

SplatLoc: 3D Gaussian Splatting-based Visual Localization for Augmented Reality

TL;DR

This work develops an unbiased 3D scene-specific descriptor decoder for Gaussian primitives, distilled from a constructed feature volume, and introduces a salient 3D landmark selection algorithm that selects a suitable primitive subset based on the saliency score for localization.

Abstract

Visual localization plays an important role in the applications of Augmented Reality (AR), which enable AR devices to obtain their 6-DoF pose in the pre-build map in order to render virtual content in real scenes. However, most existing approaches can not perform novel view rendering and require large storage capacities for maps. To overcome these limitations, we propose an efficient visual localization method capable of high-quality rendering with fewer parameters. Specifically, our approach leverages 3D Gaussian primitives as the scene representation. To ensure precise 2D-3D correspondences for pose estimation, we develop an unbiased 3D scene-specific descriptor decoder for Gaussian primitives, distilled from a constructed feature volume. Additionally, we introduce a salient 3D landmark selection algorithm that selects a suitable primitive subset based on the saliency score for localization. We further regularize key Gaussian primitives to prevent anisotropic effects, which also improves localization performance. Extensive experiments on two widely used datasets demonstrate that our method achieves superior or comparable rendering and localization performance to state-of-the-art implicit-based visual localization approaches. Project page: \href{https://zju3dv.github.io/splatloc}{https://zju3dv.github.io/splatloc}.
Paper Structure (21 sections, 20 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 20 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Reconstruction processes. We incrementally initialize the Gaussian primitives, and each primitive is associated with position $\mu$, rotation $q$, scale $s$, opacity $\sigma$, color $c$, and 3D landmark score $a$. For key Gaussian primitives, we perform soft isotropy and scale regularization to mitigate the anisotropic results. The color loss $\mathcal{L}_{c}$, depth loss $\mathcal{L}_d$, 3D landmark loss $\mathcal{L}_m$, and regularization loss $\mathcal{L}_{reg}$ are used to optimize the properties of each primitive via differentiable rasterization.
  • Figure 2: Illustration of biased and unbiased 3D descriptor field learning. (a) The biased 3D feature optimization of previous works qin2024langsplatshi2024_gs_language_embed, they use alpha-blending to obtain the 2D blended feature. (b) Our unbiased 3D feature learning scheme, which directly learns the 3D feature decoder from the constructed feature volume of multi-view feature maps.
  • Figure 3: The pipeline of our unbiased 3D primitive descriptor learning. We first encode images based on the 2D CNN model superpoint to obtain the multi-view feature maps and construct the 3D scene feature volume according to the depth and pose information. To enhance the representation ability of the 3D feature decoder, we use multi-resolution parametric encoding to aid the 3D scene-specific descriptor learning. Besides, we only sample descriptors on the scene surface for effective distillation.
  • Figure 4: Visualization of novel view synthesis. We show some novel view rendering results from different scenes. From top to bottom, there are results of PNeRFLoc pnerfloc, ours, and ground truth. Our rendering results are more clear and have less noise information.
  • Figure 5: Visual localization performance of using different resolutions of parametric encodings. We report median translation and rotation errors (cm, degree) on two selected scenes.
  • ...and 3 more figures