The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement

Gabriele Trivigno; Carlo Masone; Barbara Caputo; Torsten Sattler

The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement

Gabriele Trivigno, Carlo Masone, Barbara Caputo, Torsten Sattler

TL;DR

The paper tackles pose refinement in visual localization by asking whether specialized, per-scene features are necessary. It introduces MCLoc, a render&compare framework that uses generic, pre-trained dense features and a particle-filter optimizer to refine an initial pose without scene-specific training. The method employs a coarse-to-fine feature strategy across multiple levels, parallel particle beams, and low-to-high resolution rendering to robustly converge across large baselines and diverse scene representations. Experiments across indoor, outdoor, and large-scale datasets demonstrate competitive or superior performance relative to learned per-scene refiners, with clear benefits as both a standalone refiner and a pre/post-processing step, underscoring the practicality and generalization of generic features for pose similarity. The work highlights the practical impact of leveraging off-the-shelf features for scalable, domain-robust pose refinement and provides code to facilitate further experimentation and integration with existing localization pipelines.

Abstract

Pose refinement is an interesting and practically relevant research direction. Pose refinement can be used to (1) obtain a more accurate pose estimate from an initial prior (e.g., from retrieval), (2) as pre-processing, i.e., to provide a better starting point to a more expensive pose estimator, (3) as post-processing of a more accurate localizer. Existing approaches focus on learning features / scene representations for the pose refinement task. This involves training an implicit scene representation or learning features while optimizing a camera pose-based loss. A natural question is whether training specific features / representations is truly necessary or whether similar results can be already achieved with more generic features. In this work, we present a simple approach that combines pre-trained features with a particle filter and a renderable representation of the scene. Despite its simplicity, it achieves state-of-the-art results, demonstrating that one can easily build a pose refiner without the need for specific training. The code is at https://github.com/ga1i13o/mcloc_poseref

The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 16 sections, 2 equations, 8 figures, 8 tables, 1 algorithm.

Introduction
Related works
MCLoc
Pose alignment with Pre-trained features
Particle filter optimization
Adapting to different domains
Experiments
Implementation details
Experimental results
Conclusion
Scoring functions
Optimization hyperparameters
Convergence analysis
Inference cost
Comparison with PixLoc
...and 1 more sections

Figures (8)

Figure 1: MCLoc localizes images with a render&compare strategy. Given a starting hypothesis, a particle filter is used to perturb it and sample new candidates, which are rendered, and compared to the query using generic pre-trained features.
Figure 2: Architecture of MCLoc. It exemplifies our iterative pose refinement. Given an initial pose estimate, we perturb it and render new candidates. Candidates are ranked based on dense, pixelwise feature similarity. As optimization progresses, we exploit the hierarchical properties of deep features by switching to shallower features, which are better for fine-grained comparison.
Figure 3: Convergence Basin in Optimization Space at Multiple Scales. We perturb rotation and translation for a query from Aachen and compute the dense, pixelwise feature distance at different depths. First row: rotating along yaw and pitch axis. Second row: moving away from the GT along 3 random directions.
Figure 4: Robustness of the Convergence Basin to the Rendering Domain. We render images rotating along the yaw axis, using different meshes:Textured, Colored and Raw Geometry, and evaluate the feature distance at different depths. The domain shift affects absolute values but not the basin shape.
Figure 5: Optimization trajectory. Behavior of median errors over the iterations for 2 scenes from Cambridge Landmarks.
...and 3 more figures

The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement

TL;DR

Abstract

The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement

Authors

TL;DR

Abstract

Table of Contents

Figures (8)