Table of Contents
Fetching ...

FaVoR: Features via Voxel Rendering for Camera Relocalization

Vincenzo Polizzi, Marco Cannici, Davide Scaramuzza, Jonathan Kelly

TL;DR

FaVoR addresses robust camera relocalization under challenging viewpoint and appearance changes by building a globally sparse, locally dense voxel map of tracked landmarks and rendering high-dimensional descriptors via volumetric rendering. Each landmark is represented by a voxel grid with an associated density field, trained to reproduce observed descriptor patches from multiple views; at test time, queried poses render descriptors from the voxel map and 2D-3D correspondences are obtained for Pose estimation through an iterative Render+PnP-RANSAC procedure. The approach yields strong indoor performance improvements (up to 39% translation error reduction) on 7-Scenes and competitive outdoor results on Cambridge, with lower memory and computation compared to dense NeRF-based methods. FaVoR’s per-voxel independence enables parallel training and rendering, offering scalable integration into existing track-based localization pipelines and potential for further speedups and robustness enhancements, such as occlusion-aware retrieval using full-image descriptors.

Abstract

Camera relocalization methods range from dense image alignment to direct camera pose regression from a query image. Among these, sparse feature matching stands out as an efficient, versatile, and generally lightweight approach with numerous applications. However, feature-based methods often struggle with significant viewpoint and appearance changes, leading to matching failures and inaccurate pose estimates. To overcome this limitation, we propose a novel approach that leverages a globally sparse yet locally dense 3D representation of 2D features. By tracking and triangulating landmarks over a sequence of frames, we construct a sparse voxel map optimized to render image patch descriptors observed during tracking. Given an initial pose estimate, we first synthesize descriptors from the voxels using volumetric rendering and then perform feature matching to estimate the camera pose. This methodology enables the generation of descriptors for unseen views, enhancing robustness to view changes. We extensively evaluate our method on the 7-Scenes and Cambridge Landmarks datasets. Our results show that our method significantly outperforms existing state-of-the-art feature representation techniques in indoor environments, achieving up to a 39% improvement in median translation error. Additionally, our approach yields comparable results to other methods for outdoor scenarios while maintaining lower memory and computational costs.

FaVoR: Features via Voxel Rendering for Camera Relocalization

TL;DR

FaVoR addresses robust camera relocalization under challenging viewpoint and appearance changes by building a globally sparse, locally dense voxel map of tracked landmarks and rendering high-dimensional descriptors via volumetric rendering. Each landmark is represented by a voxel grid with an associated density field, trained to reproduce observed descriptor patches from multiple views; at test time, queried poses render descriptors from the voxel map and 2D-3D correspondences are obtained for Pose estimation through an iterative Render+PnP-RANSAC procedure. The approach yields strong indoor performance improvements (up to 39% translation error reduction) on 7-Scenes and competitive outdoor results on Cambridge, with lower memory and computation compared to dense NeRF-based methods. FaVoR’s per-voxel independence enables parallel training and rendering, offering scalable integration into existing track-based localization pipelines and potential for further speedups and robustness enhancements, such as occlusion-aware retrieval using full-image descriptors.

Abstract

Camera relocalization methods range from dense image alignment to direct camera pose regression from a query image. Among these, sparse feature matching stands out as an efficient, versatile, and generally lightweight approach with numerous applications. However, feature-based methods often struggle with significant viewpoint and appearance changes, leading to matching failures and inaccurate pose estimates. To overcome this limitation, we propose a novel approach that leverages a globally sparse yet locally dense 3D representation of 2D features. By tracking and triangulating landmarks over a sequence of frames, we construct a sparse voxel map optimized to render image patch descriptors observed during tracking. Given an initial pose estimate, we first synthesize descriptors from the voxels using volumetric rendering and then perform feature matching to estimate the camera pose. This methodology enables the generation of descriptors for unseen views, enhancing robustness to view changes. We extensively evaluate our method on the 7-Scenes and Cambridge Landmarks datasets. Our results show that our method significantly outperforms existing state-of-the-art feature representation techniques in indoor environments, achieving up to a 39% improvement in median translation error. Additionally, our approach yields comparable results to other methods for outdoor scenarios while maintaining lower memory and computational costs.
Paper Structure (19 sections, 14 equations, 7 figures, 8 tables)

This paper contains 19 sections, 14 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Schematic representation of the proposed method. 1) We track and triangulate feature points to create a voxel representation for persistent landmarks, i.e. landmarks observed by a large number of views. After that, 2) the voxels are optimized to render the descriptors patches extracted during the feature tracking. At test time, 3) the voxels are queried to render the descriptors as seen from a given query pose, and the 2D-3D matches between the query image and the landmarks are found and used to perform pose estimation.
  • Figure 2: Visualization of similarity response. We render a feature tracked during training using the Alike-l descriptor from an unseen view. On the left, a) displays the ground truth positions of the rendered feature points, obtained by projecting the triangulated landmarks on the camera plane, in red. At the same time, b) and c) show the similarity response between the rendered features and the target image dense descriptor map. The yellow colour indicates a strong response, concentrated around the feature positions shown in a), demonstrating the effectiveness of our descriptor rendering approach. The small circle in blue is the circle centre at the strongest response, the red circle is centred at the project landmark position. The top three response peaks per keypoint are 0.897, 0.868, and 0.836 for b) and 0.856, 0.786, and 0.737 for c) obtained by performing non-maxima suppression with a ray of three pixels, our method effectively renders descriptors that are distinctive in the image.
  • Figure 3: Median similarity score versus viewing angle. In blue is the smoothed median score for FaVoRAlike-l obtained by convolving the descriptors rendered at different view angles with the corresponding dense descriptors map of each query image. In orange is the smoothed median score of Alike-l features extracted from the starting image (at angle 0 deg) convolved with the subsequent images in the test sequence.
  • Figure 4: PSNR and model size versus grid resolution. We report the median peak signal-to-noise ratio (PSNR) and the average checkpoint size for FaVoRAlike-l at different grid resolutions of the voxel representation.
  • Figure 5: Similarity response scores versus grid resolution at different view angles. We compare the different grid resolutions' capacity to provide high score similarity score results at different view angles. Higher scores lead to better matching meaning that the rendered descriptors properly match the appearance of the ones extracted by Alike-l Zhao2022ALIKE.
  • ...and 2 more figures