Table of Contents
Fetching ...

Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors

Son Tung Nguyen, Tobias Fischer, Alejandro Fontan, Michael Milford

TL;DR

The paper tackles perceptual aliasing in visual Localization by introducing a neural aggregator that learns geometrically-consistent global descriptors, aligning visual similarity with covisibility structure.A batch-mining-based training scheme and a modified Generalized Contrastive Loss (mGCL) enable training without manual place labels and improve robustness to noisy graphs.The method achieves notable gains over R-Score on challenging benchmarks (e.g., Aachen Day/Night, Hyundai Department Store) while maintaining low memory overhead, narrowing the gap to traditional structure-based methods.Overall, the work demonstrates that jointly optimized local and global descriptors, guided by dual geometric-visual consistency, substantially enhances SCR performance in large-scale, alias-prone environments.

Abstract

Recent learning-based visual localization methods use global descriptors to disambiguate visually similar places, but existing approaches often derive these descriptors from geometric cues alone (e.g., covisibility graphs), limiting their discriminative power and reducing robustness in the presence of noisy geometric constraints. We propose an aggregator module that learns global descriptors consistent with both geometrical structure and visual similarity, ensuring that images are close in descriptor space only when they are visually similar and spatially connected. This corrects erroneous associations caused by unreliable overlap scores. Using a batch-mining strategy based solely on the overlap scores and a modified contrastive loss, our method trains without manual place labels and generalizes across diverse environments. Experiments on challenging benchmarks show substantial localization gains in large-scale environments while preserving computational and memory efficiency. Code is available at \href{https://github.com/sontung/robust\_scr}{github.com/sontung/robust\_scr}.

Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors

TL;DR

The paper tackles perceptual aliasing in visual Localization by introducing a neural aggregator that learns geometrically-consistent global descriptors, aligning visual similarity with covisibility structure.A batch-mining-based training scheme and a modified Generalized Contrastive Loss (mGCL) enable training without manual place labels and improve robustness to noisy graphs.The method achieves notable gains over R-Score on challenging benchmarks (e.g., Aachen Day/Night, Hyundai Department Store) while maintaining low memory overhead, narrowing the gap to traditional structure-based methods.Overall, the work demonstrates that jointly optimized local and global descriptors, guided by dual geometric-visual consistency, substantially enhances SCR performance in large-scale, alias-prone environments.

Abstract

Recent learning-based visual localization methods use global descriptors to disambiguate visually similar places, but existing approaches often derive these descriptors from geometric cues alone (e.g., covisibility graphs), limiting their discriminative power and reducing robustness in the presence of noisy geometric constraints. We propose an aggregator module that learns global descriptors consistent with both geometrical structure and visual similarity, ensuring that images are close in descriptor space only when they are visually similar and spatially connected. This corrects erroneous associations caused by unreliable overlap scores. Using a batch-mining strategy based solely on the overlap scores and a modified contrastive loss, our method trains without manual place labels and generalizes across diverse environments. Experiments on challenging benchmarks show substantial localization gains in large-scale environments while preserving computational and memory efficiency. Code is available at \href{https://github.com/sontung/robust\_scr}{github.com/sontung/robust\_scr}.

Paper Structure

This paper contains 15 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Performance overview on Aachen Day/Night dataset. Our method achieves significant improvements over existing learning-based approaches while maintaining comparable memory efficiency. Compared to R-Score rscore, we achieve $6.1\%$ higher accuracy on night-time images at the $0.25\text{m}$ / $2^\circ$ threshold and 2.2% average improvement across all evaluation thresholds (detailed results in Table \ref{['tab:aachen']}). This performance narrows the gap between learning-based and traditional structure-based methods while preserving the memory advantages of coordinate regression approaches.
  • Figure 2: System overview. Our pipeline consists of four main components: (1) Local descriptor extraction using DeDoDe edstedt2024dedode, (2) DINO dino feature extraction for visual representation, (3) Our proposed aggregator module that learns geometrically-consistent global descriptors from DINO features using dual consistency constraints, and (4) Scene coordinate regression model that predicts 3D coordinates from concatenated local-global descriptors.
  • Figure 3: Median translation error. We plot the median translation error of the top-$k$ retrievals for all the training images. The results show that our learned global descriptors (in green) offer more relevant retrievals over the node2vec graph embeddings (in blue) of R-Score rscore. We also plot the ground-truth errors (in red) for reference.
  • Figure 4: Qualitative results. We visualize the top-$5$ retrievals for three images from the training set of the Aachen Day/Night dataset sattler2018benchmarking. For each query (left-most column), we show results using graph embeddings from R-Score rscore (top) and our learned global descriptors (bottom). On top of each retrieval, we plot the translation error and the overlap score between the retrievals and the query. Under heavy noise in the covisibility graph (evidenced by low overlap scores), the graph embeddings retrieve nearby but not exact structures, while our global descriptors retrieve nearby and relevant ones.