Table of Contents
Fetching ...

NeuraLoc: Visual Localization in Neural Implicit Map with Dual Complementary Features

Hongjia Zhai, Boming Zhao, Hai Li, Xiaokun Pan, Yijia He, Zhaopeng Cui, Hujun Bao, Guofeng Zhang

TL;DR

Problem: visual localization with NeRF-based representations requires accurate 6-DoF pose estimation while maintaining compact scene models. Approach: NeuraLoc learns a neural implicit map with a 3D descriptor field $g^{3D}$ and a semantic contextual feature field $f^{3D}$, distills 2D descriptors from SuperPoint and contextual features from SAM, and uses descriptor similarity distribution alignment via KL divergence to bridge 2D-3D spaces; a dual-feature matching graph establishes robust 2D-3D correspondences for pose estimation. Contributions: implicit 3D descriptor field reducing per-point storage, semantic contextual feature field for robust matching, a similarity distribution alignment loss $\\mathcal{L}_{kl}$ to reduce domain gap, and a matching graph leveraging both descriptors and contextual features for accurate 6-DoF pose. Findings: achieves $3\times$ faster training and $45\times$ fewer model parameters than recent NeRF-based methods, with state-of-the-art or competitive localization on Replica and 12-Scenes, validating both efficiency and accuracy for practical deployment.

Abstract

Recently, neural radiance fields (NeRF) have gained significant attention in the field of visual localization. However, existing NeRF-based approaches either lack geometric constraints or require extensive storage for feature matching, limiting their practical applications. To address these challenges, we propose an efficient and novel visual localization approach based on the neural implicit map with complementary features. Specifically, to enforce geometric constraints and reduce storage requirements, we implicitly learn a 3D keypoint descriptor field, avoiding the need to explicitly store point-wise features. To further address the semantic ambiguity of descriptors, we introduce additional semantic contextual feature fields, which enhance the quality and reliability of 2D-3D correspondences. Besides, we propose descriptor similarity distribution alignment to minimize the domain gap between 2D and 3D feature spaces during matching. Finally, we construct the matching graph using both complementary descriptors and contextual features to establish accurate 2D-3D correspondences for 6-DoF pose estimation. Compared with the recent NeRF-based approaches, our method achieves a 3$\times$ faster training speed and a 45$\times$ reduction in model storage. Extensive experiments on two widely used datasets demonstrate that our approach outperforms or is highly competitive with other state-of-the-art NeRF-based visual localization methods. Project page: \href{https://zju3dv.github.io/neuraloc}{https://zju3dv.github.io/neuraloc}

NeuraLoc: Visual Localization in Neural Implicit Map with Dual Complementary Features

TL;DR

Problem: visual localization with NeRF-based representations requires accurate 6-DoF pose estimation while maintaining compact scene models. Approach: NeuraLoc learns a neural implicit map with a 3D descriptor field and a semantic contextual feature field , distills 2D descriptors from SuperPoint and contextual features from SAM, and uses descriptor similarity distribution alignment via KL divergence to bridge 2D-3D spaces; a dual-feature matching graph establishes robust 2D-3D correspondences for pose estimation. Contributions: implicit 3D descriptor field reducing per-point storage, semantic contextual feature field for robust matching, a similarity distribution alignment loss to reduce domain gap, and a matching graph leveraging both descriptors and contextual features for accurate 6-DoF pose. Findings: achieves faster training and fewer model parameters than recent NeRF-based methods, with state-of-the-art or competitive localization on Replica and 12-Scenes, validating both efficiency and accuracy for practical deployment.

Abstract

Recently, neural radiance fields (NeRF) have gained significant attention in the field of visual localization. However, existing NeRF-based approaches either lack geometric constraints or require extensive storage for feature matching, limiting their practical applications. To address these challenges, we propose an efficient and novel visual localization approach based on the neural implicit map with complementary features. Specifically, to enforce geometric constraints and reduce storage requirements, we implicitly learn a 3D keypoint descriptor field, avoiding the need to explicitly store point-wise features. To further address the semantic ambiguity of descriptors, we introduce additional semantic contextual feature fields, which enhance the quality and reliability of 2D-3D correspondences. Besides, we propose descriptor similarity distribution alignment to minimize the domain gap between 2D and 3D feature spaces during matching. Finally, we construct the matching graph using both complementary descriptors and contextual features to establish accurate 2D-3D correspondences for 6-DoF pose estimation. Compared with the recent NeRF-based approaches, our method achieves a 3 faster training speed and a 45 reduction in model storage. Extensive experiments on two widely used datasets demonstrate that our approach outperforms or is highly competitive with other state-of-the-art NeRF-based visual localization methods. Project page: \href{https://zju3dv.github.io/neuraloc}{https://zju3dv.github.io/neuraloc}

Paper Structure

This paper contains 14 sections, 14 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The whole pipeline of our system. (1) Reconstruction: We employ different parametric encodings ($\mathcal{T}_{geo}$ and $\mathcal{T}_{sem}$) for geometry and semantic branches. Scene properties, including color $c$, SDF $\sigma$, semantic contextual feature $f^{3D}$, and keypoint descriptor $g^{3D}$ are produced by separated shadow decoders ($\mathcal{M}_{geo}$ and $\mathcal{M}_{sem}$). We use pre-trained CNN models (SuperPoint superpoint and SAM sam) to generate 2D feature maps for the optimization of the semantic branch. (2) Localization: We extract 2D descriptors and semantic contextual features for the query image to build the matching graph between 3D points. Then, we estimate the 6-DoF pose based on the 2D-3D correspondence.
  • Figure 2: Descriptor similarity alignment. To reduce the domain gap, we perform the similarity distribution alignment between the 2D-2D and 2D-3D similarity distribution for better optimization.
  • Figure 3: Trajectory visualization of two selected scenes.
  • Figure 4: Qualitative results of feature matching. We show some matching results of our method on Replica julian:2019:replica and 12-Scenes 12scenes dataset.
  • Figure 5: Median localization errors (cm, degree) of using different features.