FaVoR: Features via Voxel Rendering for Camera Relocalization
Vincenzo Polizzi, Marco Cannici, Davide Scaramuzza, Jonathan Kelly
TL;DR
FaVoR addresses robust camera relocalization under challenging viewpoint and appearance changes by building a globally sparse, locally dense voxel map of tracked landmarks and rendering high-dimensional descriptors via volumetric rendering. Each landmark is represented by a voxel grid with an associated density field, trained to reproduce observed descriptor patches from multiple views; at test time, queried poses render descriptors from the voxel map and 2D-3D correspondences are obtained for Pose estimation through an iterative Render+PnP-RANSAC procedure. The approach yields strong indoor performance improvements (up to 39% translation error reduction) on 7-Scenes and competitive outdoor results on Cambridge, with lower memory and computation compared to dense NeRF-based methods. FaVoR’s per-voxel independence enables parallel training and rendering, offering scalable integration into existing track-based localization pipelines and potential for further speedups and robustness enhancements, such as occlusion-aware retrieval using full-image descriptors.
Abstract
Camera relocalization methods range from dense image alignment to direct camera pose regression from a query image. Among these, sparse feature matching stands out as an efficient, versatile, and generally lightweight approach with numerous applications. However, feature-based methods often struggle with significant viewpoint and appearance changes, leading to matching failures and inaccurate pose estimates. To overcome this limitation, we propose a novel approach that leverages a globally sparse yet locally dense 3D representation of 2D features. By tracking and triangulating landmarks over a sequence of frames, we construct a sparse voxel map optimized to render image patch descriptors observed during tracking. Given an initial pose estimate, we first synthesize descriptors from the voxels using volumetric rendering and then perform feature matching to estimate the camera pose. This methodology enables the generation of descriptors for unseen views, enhancing robustness to view changes. We extensively evaluate our method on the 7-Scenes and Cambridge Landmarks datasets. Our results show that our method significantly outperforms existing state-of-the-art feature representation techniques in indoor environments, achieving up to a 39% improvement in median translation error. Additionally, our approach yields comparable results to other methods for outdoor scenarios while maintaining lower memory and computational costs.
