Table of Contents
Fetching ...

NeRF-Supervised Feature Point Detection and Description

Ali Youssef, Francisco Vasconcelos

TL;DR

A novel approach leveraging Neural Radiance Fields (NeRFs) to generate a diverse and realistic dataset consisting of indoor and outdoor scenes, achieving competitive or superior performance on standard benchmarks for relative pose estimation, point cloud registration, and homography estimation while requiring significantly less training data and time compared to existing approaches.

Abstract

Feature point detection and description is the backbone for various computer vision applications, such as Structure-from-Motion, visual SLAM, and visual place recognition. While learning-based methods have surpassed traditional handcrafted techniques, their training often relies on simplistic homography-based simulations of multi-view perspectives, limiting model generalisability. This paper presents a novel approach leveraging Neural Radiance Fields (NeRFs) to generate a diverse and realistic dataset consisting of indoor and outdoor scenes. Our proposed methodology adapts state-of-the-art feature detectors and descriptors for training on multi-view NeRF-synthesised data, with supervision achieved through perspective projective geometry. Experiments demonstrate that the proposed methodology achieves competitive or superior performance on standard benchmarks for relative pose estimation, point cloud registration, and homography estimation while requiring significantly less training data and time compared to existing approaches.

NeRF-Supervised Feature Point Detection and Description

TL;DR

A novel approach leveraging Neural Radiance Fields (NeRFs) to generate a diverse and realistic dataset consisting of indoor and outdoor scenes, achieving competitive or superior performance on standard benchmarks for relative pose estimation, point cloud registration, and homography estimation while requiring significantly less training data and time compared to existing approaches.

Abstract

Feature point detection and description is the backbone for various computer vision applications, such as Structure-from-Motion, visual SLAM, and visual place recognition. While learning-based methods have surpassed traditional handcrafted techniques, their training often relies on simplistic homography-based simulations of multi-view perspectives, limiting model generalisability. This paper presents a novel approach leveraging Neural Radiance Fields (NeRFs) to generate a diverse and realistic dataset consisting of indoor and outdoor scenes. Our proposed methodology adapts state-of-the-art feature detectors and descriptors for training on multi-view NeRF-synthesised data, with supervision achieved through perspective projective geometry. Experiments demonstrate that the proposed methodology achieves competitive or superior performance on standard benchmarks for relative pose estimation, point cloud registration, and homography estimation while requiring significantly less training data and time compared to existing approaches.
Paper Structure (28 sections, 6 equations, 4 figures, 8 tables)

This paper contains 28 sections, 6 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Visual representation of multi-view data. Learning-based detectors and descriptors achieve supervision on single-view datasets by simulating different viewpoints through homographic warpings to the input image $I$ (cf. Fig. \ref{['fig:short-a']}) resulting in $I^{'}$ (cf. Fig. \ref{['fig:short-b']}). However, we achieve supervision by directly sampling a NeRF-rendered image from a different viewpoint (cf. Fig. \ref{['fig:short-c']}).
  • Figure 2: Depth window estimation. The interest points, depicted in red and purple situated at the painting's frame edge in Fig. \ref{['fig:depth-a']}, exhibit misprojection onto image $I^{'}$ without the depth window estimation method. However, misprojection errors are effectively mitigated by utilising the depth window estimation method as seen in Fig. \ref{['fig:depth-c']}.
  • Figure 3: Rotation and Scale invariance. In Fig. \ref{['fig:H-b']}, it is evident that SiLK-PrP lacks rotation and scale invariance compared to SiLK (cf. Fig. \ref{['fig:H-a']}) SiLK. Incorporating rotation and scaling augmentations during SiLK-PrP's training, rotation and scale invariance is achieved (cf. Fig \ref{['fig:H-c']}).
  • Figure 4: Angular Translation Error Instability. Density plot illustrating how the angular translation error is unstable in situations when the ground truth relative translation between two camera viewpoints is minimal.