Table of Contents
Fetching ...

The NeRFect Match: Exploring NeRF Features for Visual Localization

Qunjie Zhou, Maxim Maximov, Or Litany, Laura Leal-Taixé

TL;DR

This paper investigates using NeRF as the primary scene representation for visual localization and introduces NeRFMatch, an image-to-NeRF matching framework that aligns 2D image features with NeRF-derived 3D features. A lightweight NeRFMatch-Mini and a full attention-based NeRFMatch are presented, paired with two pose refinement strategies (iterative and optimization-based) to produce a hierarchical NeRF localization pipeline. Experiments on Cambridge Landmarks and 7-Scenes demonstrate competitive results, highlight the discriminative power of NeRF internal features for 2D-3D matching, and reveal indoor localization challenges. The work points toward NeRF-only localization possibilities and outlines limitations and avenues for improving indoor performance and scalability.

Abstract

In this work, we propose the use of Neural Radiance Fields (NeRF) as a scene representation for visual localization. Recently, NeRF has been employed to enhance pose regression and scene coordinate regression models by augmenting the training database, providing auxiliary supervision through rendered images, or serving as an iterative refinement module. We extend its recognized advantages -- its ability to provide a compact scene representation with realistic appearances and accurate geometry -- by exploring the potential of NeRF's internal features in establishing precise 2D-3D matches for localization. To this end, we conduct a comprehensive examination of NeRF's implicit knowledge, acquired through view synthesis, for matching under various conditions. This includes exploring different matching network architectures, extracting encoder features at multiple layers, and varying training configurations. Significantly, we introduce NeRFMatch, an advanced 2D-3D matching function that capitalizes on the internal knowledge of NeRF learned via view synthesis. Our evaluation of NeRFMatch on standard localization benchmarks, within a structure-based pipeline, sets a new state-of-the-art for localization performance on Cambridge Landmarks.

The NeRFect Match: Exploring NeRF Features for Visual Localization

TL;DR

This paper investigates using NeRF as the primary scene representation for visual localization and introduces NeRFMatch, an image-to-NeRF matching framework that aligns 2D image features with NeRF-derived 3D features. A lightweight NeRFMatch-Mini and a full attention-based NeRFMatch are presented, paired with two pose refinement strategies (iterative and optimization-based) to produce a hierarchical NeRF localization pipeline. Experiments on Cambridge Landmarks and 7-Scenes demonstrate competitive results, highlight the discriminative power of NeRF internal features for 2D-3D matching, and reveal indoor localization challenges. The work points toward NeRF-only localization possibilities and outlines limitations and avenues for improving indoor performance and scalability.

Abstract

In this work, we propose the use of Neural Radiance Fields (NeRF) as a scene representation for visual localization. Recently, NeRF has been employed to enhance pose regression and scene coordinate regression models by augmenting the training database, providing auxiliary supervision through rendered images, or serving as an iterative refinement module. We extend its recognized advantages -- its ability to provide a compact scene representation with realistic appearances and accurate geometry -- by exploring the potential of NeRF's internal features in establishing precise 2D-3D matches for localization. To this end, we conduct a comprehensive examination of NeRF's implicit knowledge, acquired through view synthesis, for matching under various conditions. This includes exploring different matching network architectures, extracting encoder features at multiple layers, and varying training configurations. Significantly, we introduce NeRFMatch, an advanced 2D-3D matching function that capitalizes on the internal knowledge of NeRF learned via view synthesis. Our evaluation of NeRFMatch on standard localization benchmarks, within a structure-based pipeline, sets a new state-of-the-art for localization performance on Cambridge Landmarks.
Paper Structure (21 sections, 4 equations, 6 figures, 7 tables)

This paper contains 21 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: NeRF-based localization overview. In this work, we propose to use NeRF as our scene representation for visual localization. Given a query image, we first retrieve its nearest reference pose using image retrieval, then use NeRFMatch to establish 2D-3D correspondences between the query image and the NeRF scene points to compute an initial pose estimate and finally improve its accuracy via pose refinement.
  • Figure 1: Example of masking on Kings College scene. Top images - original images, bottom - semantic segmentation using cheng2022masked.
  • Figure 2: An overview of the standard NeRF architecture. The input consists of a scene coordinate $X$ and ray directions $d$. The outputs include color $c$, density $\sigma$. We obtain intermediate features, denoted as $f^{j}$, using volumetric rendering.
  • Figure 2: Example of masking on the King's College scene of Cambridge Landmarks kendall2015posenet. The bottom row are rendered with NeRF, and the top row - ground truth images.
  • Figure 3: NeRFMatch architecture. We present NeRFMatch as our full matching model (rightmost) and NeRFMatch-Mini as a light version of it (middle). Both models share the same feature extraction process, where we use a 2D encoder to extract image features at two resolutions and render 3D points with associated NeRF features at sampled 2D pixel locations from the reference viewpoint. The full matching uses self-attention (SA) and cross-attention (CA) with positional encodings (PE).
  • ...and 1 more figures