Table of Contents
Fetching ...

R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization

Xudong Jiang, Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Marc Pollefeys

TL;DR

R-SCoRe revisits scene coordinate regression for robust, large-scale visual localization. It introduces covisibility graph-based global encoding learning, data augmentation, and a depth-adjusted reprojection loss to promote implicit triangulation, integrated in a GLACE-inspired coarse-to-fine network with a refinement module. The approach achieves state-of-the-art SCR performance on challenging datasets like Aachen Day-Night with a small map (47MB) and up to 10x improvement over previous SCR methods, approaching the accuracy of feature-matching methods while maintaining a fraction of map size. Ablations demonstrate the value of multi-hypotheses testing, covisibility-informed encodings, and depth-aware supervision, supporting practical, scalable localization without 3D ground-truth supervision. The work suggests promising directions for closing the accuracy gap to FM methods through further integration with generative models and model compression.

Abstract

Learning-based visual localization methods that use scene coordinate regression (SCR) offer the advantage of smaller map sizes. However, on datasets with complex illumination changes or image-level ambiguities, it remains a less robust alternative to feature matching methods. This work aims to close the gap. We introduce a covisibility graph-based global encoding learning and data augmentation strategy, along with a depth-adjusted reprojection loss to facilitate implicit triangulation. Additionally, we revisit the network architecture and local feature extraction module. Our method achieves state-of-the-art on challenging large-scale datasets without relying on network ensembles or 3D supervision. On Aachen Day-Night, we are 10$\times$ more accurate than previous SCR methods with similar map sizes and require at least 5$\times$ smaller map sizes than any other SCR method while still delivering superior accuracy. Code is available at: https://github.com/cvg/scrstudio .

R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization

TL;DR

R-SCoRe revisits scene coordinate regression for robust, large-scale visual localization. It introduces covisibility graph-based global encoding learning, data augmentation, and a depth-adjusted reprojection loss to promote implicit triangulation, integrated in a GLACE-inspired coarse-to-fine network with a refinement module. The approach achieves state-of-the-art SCR performance on challenging datasets like Aachen Day-Night with a small map (47MB) and up to 10x improvement over previous SCR methods, approaching the accuracy of feature-matching methods while maintaining a fraction of map size. Ablations demonstrate the value of multi-hypotheses testing, covisibility-informed encodings, and depth-aware supervision, supporting practical, scalable localization without 3D ground-truth supervision. The work suggests promising directions for closing the accuracy gap to FM methods through further integration with generative models and model compression.

Abstract

Learning-based visual localization methods that use scene coordinate regression (SCR) offer the advantage of smaller map sizes. However, on datasets with complex illumination changes or image-level ambiguities, it remains a less robust alternative to feature matching methods. This work aims to close the gap. We introduce a covisibility graph-based global encoding learning and data augmentation strategy, along with a depth-adjusted reprojection loss to facilitate implicit triangulation. Additionally, we revisit the network architecture and local feature extraction module. Our method achieves state-of-the-art on challenging large-scale datasets without relying on network ensembles or 3D supervision. On Aachen Day-Night, we are 10 more accurate than previous SCR methods with similar map sizes and require at least 5 smaller map sizes than any other SCR method while still delivering superior accuracy. Code is available at: https://github.com/cvg/scrstudio .
Paper Structure (24 sections, 11 equations, 8 figures, 10 tables)

This paper contains 24 sections, 11 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Robust Visual Localization with R-SCoRe.Left: Point cloud of Aachen reconstructed by R-SCoRe. Right: On the large-scale Aachen Day-Night dataset Sattler2012BMVCSattler2018CVPR using only daytime training images, R-SCoRe achieves 64.3% accuracy under the (0.25m, 2°) threshold for nighttime query images. It outperforms all previous SCR methods (circles) by a large margin. With a small map size of only 47MB at a comparable accuracy, R-SCoRe is an attractive alternative to traditional methods (triangles).
  • Figure 2: R-SCoRe pipeline. (a) Following the SCR workflow in glace2024cvpr, we concatenate patch-level local encodings with image-level global encodings as input to a scene-specific MLP. (b) We learn contrastive global encodings from the covisibility graph using Node2Vec node2vec-kdd2016. During training, global encodings are sampled from neighboring nodes for data augmentation. During inference, we retrieve global encodings from the $k$ nearest training images via NetVLAD arandjelovic16netvlad as hypotheses and select the one yielding the most RANSAC inliers. We enhance the SCR MLP with a refinement module and introduce a depth-adjusted reprojection loss to reduce bias toward distant points.
  • Figure 3: Comparison of global encodings. Aligning the learning of global encodings with the covisibility graph topology (Node2Vec node2vec-kdd2016) helps distinguish covisible and non-covisible pairs (a) and predict covisibility by feature distance (b).
  • Figure 4: Statistics of reprojection error for points with different depths. The kernel density estimation (KDE) of reprojection error distribution conditioned on disparity from SCR models trained with various local encodings across different datasets. We observe that far points (low disparity) exhibit a lower reprojection error. (Detector-free LoFTR sun2021loftr with an 8 $\times$ downsampled output has a larger 2D keypoint error than Detector-based Dedode edstedt2024dedode.)
  • Figure 5: Ablation study of depth distribution after training with different supervision methods. Our depth-adjusted supervision matches the distribution of ground truth depth for supervision as compared to the original.
  • ...and 3 more figures