Table of Contents
Fetching ...

Leveraging Neural Radiance Field in Descriptor Synthesis for Keypoints Scene Coordinate Regression

Huy-Hoang Bui, Bach-Thuan Bui, Dinh-Tuan Tran, Joo-Ho Lee

TL;DR

The paper tackles data-scarce visual localization by augmenting descriptor-based KSCR (D2S) with a NeRF-driven data synthesis pipeline. It trains a Nerfacto NeRF, synthesizes novel views through pose interpolation and view rendering, and uses robust feature matching to integrate synthetic data into KSCR training. Empirical results on 7Scenes and 12Scenes show improved translation and rotation accuracy, outperforming several SCR and few-shot baselines while requiring fewer real-world images. The approach is modular and scalable, with potential for incorporating multiple NeRFs, though outdoor and dynamic environments remain a challenge for NeRF-based rendering.

Abstract

Classical structural-based visual localization methods offer high accuracy but face trade-offs in terms of storage, speed, and privacy. A recent innovation, keypoint scene coordinate regression (KSCR) named D2S addresses these issues by leveraging graph attention networks to enhance keypoint relationships and predict their 3D coordinates using a simple multilayer perceptron (MLP). Camera pose is then determined via PnP+RANSAC, using established 2D-3D correspondences. While KSCR achieves competitive results, rivaling state-of-the-art image-retrieval methods like HLoc across multiple benchmarks, its performance is hindered when data samples are limited due to the deep learning model's reliance on extensive data. This paper proposes a solution to this challenge by introducing a pipeline for keypoint descriptor synthesis using Neural Radiance Field (NeRF). By generating novel poses and feeding them into a trained NeRF model to create new views, our approach enhances the KSCR's generalization capabilities in data-scarce environments. The proposed system could significantly improve localization accuracy by up to 50% and cost only a fraction of time for data synthesis. Furthermore, its modular design allows for the integration of multiple NeRFs, offering a versatile and efficient solution for visual localization. The implementation is publicly available at: https://github.com/ais-lab/DescriptorSynthesis4Feat2Map.

Leveraging Neural Radiance Field in Descriptor Synthesis for Keypoints Scene Coordinate Regression

TL;DR

The paper tackles data-scarce visual localization by augmenting descriptor-based KSCR (D2S) with a NeRF-driven data synthesis pipeline. It trains a Nerfacto NeRF, synthesizes novel views through pose interpolation and view rendering, and uses robust feature matching to integrate synthetic data into KSCR training. Empirical results on 7Scenes and 12Scenes show improved translation and rotation accuracy, outperforming several SCR and few-shot baselines while requiring fewer real-world images. The approach is modular and scalable, with potential for incorporating multiple NeRFs, though outdoor and dynamic environments remain a challenge for NeRF-based rendering.

Abstract

Classical structural-based visual localization methods offer high accuracy but face trade-offs in terms of storage, speed, and privacy. A recent innovation, keypoint scene coordinate regression (KSCR) named D2S addresses these issues by leveraging graph attention networks to enhance keypoint relationships and predict their 3D coordinates using a simple multilayer perceptron (MLP). Camera pose is then determined via PnP+RANSAC, using established 2D-3D correspondences. While KSCR achieves competitive results, rivaling state-of-the-art image-retrieval methods like HLoc across multiple benchmarks, its performance is hindered when data samples are limited due to the deep learning model's reliance on extensive data. This paper proposes a solution to this challenge by introducing a pipeline for keypoint descriptor synthesis using Neural Radiance Field (NeRF). By generating novel poses and feeding them into a trained NeRF model to create new views, our approach enhances the KSCR's generalization capabilities in data-scarce environments. The proposed system could significantly improve localization accuracy by up to 50% and cost only a fraction of time for data synthesis. Furthermore, its modular design allows for the integration of multiple NeRFs, offering a versatile and efficient solution for visual localization. The implementation is publicly available at: https://github.com/ais-lab/DescriptorSynthesis4Feat2Map.
Paper Structure (21 sections, 10 equations, 6 figures, 3 tables)

This paper contains 21 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Performance degradation of D2S bui_d2s_2023 when training data is reduced. Average translation error (cm) and rotation error (degree) for all scenes in 7Scenes datasets are reported. The amount of training data ranges from 25 % to 1%.
  • Figure 2: Descriptor synthesis pipeline for D2S. First, the NeRF model is trained with available images in the dataset (a). Then new camera poses are generated using a uniform sample between translation and quaternion interpolation between poses (b). Next, new poses are fed into the trained NeRF for novel view synthesis (c). 2D-2D correspondences between reference frames and novel frames are established and descriptors from novel frames are extracted (d). Finally, D2S is trained with original data and synthesized data (e).
  • Figure 3: Illustration of camera position of training images and generated camera position. The top row depicts the camera position of training images, and the bottom row shows the generated camera position for NeRF rendering.
  • Figure 4: Synthesized images from NeRF. The columns from left to right illustrate synthesized images for Chess, Fire, and Heads respectively. The first row shows favorable results while the second presents filtered fail cases.
  • Figure 5: Keypoints matching result between real and synthetic images. The top row illustrates reference images while the bottom row presents synthesized images.
  • ...and 1 more figures