The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement
Gabriele Trivigno, Carlo Masone, Barbara Caputo, Torsten Sattler
TL;DR
The paper tackles pose refinement in visual localization by asking whether specialized, per-scene features are necessary. It introduces MCLoc, a render&compare framework that uses generic, pre-trained dense features and a particle-filter optimizer to refine an initial pose without scene-specific training. The method employs a coarse-to-fine feature strategy across multiple levels, parallel particle beams, and low-to-high resolution rendering to robustly converge across large baselines and diverse scene representations. Experiments across indoor, outdoor, and large-scale datasets demonstrate competitive or superior performance relative to learned per-scene refiners, with clear benefits as both a standalone refiner and a pre/post-processing step, underscoring the practicality and generalization of generic features for pose similarity. The work highlights the practical impact of leveraging off-the-shelf features for scalable, domain-robust pose refinement and provides code to facilitate further experimentation and integration with existing localization pipelines.
Abstract
Pose refinement is an interesting and practically relevant research direction. Pose refinement can be used to (1) obtain a more accurate pose estimate from an initial prior (e.g., from retrieval), (2) as pre-processing, i.e., to provide a better starting point to a more expensive pose estimator, (3) as post-processing of a more accurate localizer. Existing approaches focus on learning features / scene representations for the pose refinement task. This involves training an implicit scene representation or learning features while optimizing a camera pose-based loss. A natural question is whether training specific features / representations is truly necessary or whether similar results can be already achieved with more generic features. In this work, we present a simple approach that combines pre-trained features with a particle filter and a renderable representation of the scene. Despite its simplicity, it achieves state-of-the-art results, demonstrating that one can easily build a pose refiner without the need for specific training. The code is at https://github.com/ga1i13o/mcloc_poseref
