Table of Contents
Fetching ...

GLACE: Global Local Accelerated Coordinate Encoding

Fangjinhua Wang, Xudong Jiang, Silvano Galliani, Christoph Vogel, Marc Pollefeys

TL;DR

GLACE addresses the challenge of visual localization for large-scale scenes using scene coordinate regression without ground-truth 3D models. It introduces co-visibility-aware global encodings combined with local SCR features via a feature-diffusion mechanism, enabling effective implicit triangulation. A multimodal position decoder replaces the unimodal center prior, allowing accurate large-scale coordinate predictions by leveraging cluster centers of training camera poses. Empirically, GLACE achieves state-of-the-art performance on Cambridge landmarks and Aachen Day-Night with a single compact model and no depth maps, outperforming ensemble SCR baselines while maintaining small map sizes, and demonstrating practical scalability for real-world localization tasks.

Abstract

Scene coordinate regression (SCR) methods are a family of visual localization methods that directly regress 2D-3D matches for camera pose estimation. They are effective in small-scale scenes but face significant challenges in large-scale scenes that are further amplified in the absence of ground truth 3D point clouds for supervision. Here, the model can only rely on reprojection constraints and needs to implicitly triangulate the points. The challenges stem from a fundamental dilemma: The network has to be invariant to observations of the same landmark at different viewpoints and lighting conditions, etc., but at the same time discriminate unrelated but similar observations. The latter becomes more relevant and severe in larger scenes. In this work, we tackle this problem by introducing the concept of co-visibility to the network. We propose GLACE, which integrates pre-trained global and local encodings and enables SCR to scale to large scenes with only a single small-sized network. Specifically, we propose a novel feature diffusion technique that implicitly groups the reprojection constraints with co-visibility and avoids overfitting to trivial solutions. Additionally, our position decoder parameterizes the output positions for large-scale scenes more effectively. Without using 3D models or depth maps for supervision, our method achieves state-of-the-art results on large-scale scenes with a low-map-size model. On Cambridge landmarks, with a single model, we achieve 17% lower median position error than Poker, the ensemble variant of the state-of-the-art SCR method ACE. Code is available at: https://github.com/cvg/glace.

GLACE: Global Local Accelerated Coordinate Encoding

TL;DR

GLACE addresses the challenge of visual localization for large-scale scenes using scene coordinate regression without ground-truth 3D models. It introduces co-visibility-aware global encodings combined with local SCR features via a feature-diffusion mechanism, enabling effective implicit triangulation. A multimodal position decoder replaces the unimodal center prior, allowing accurate large-scale coordinate predictions by leveraging cluster centers of training camera poses. Empirically, GLACE achieves state-of-the-art performance on Cambridge landmarks and Aachen Day-Night with a single compact model and no depth maps, outperforming ensemble SCR baselines while maintaining small map sizes, and demonstrating practical scalability for real-world localization tasks.

Abstract

Scene coordinate regression (SCR) methods are a family of visual localization methods that directly regress 2D-3D matches for camera pose estimation. They are effective in small-scale scenes but face significant challenges in large-scale scenes that are further amplified in the absence of ground truth 3D point clouds for supervision. Here, the model can only rely on reprojection constraints and needs to implicitly triangulate the points. The challenges stem from a fundamental dilemma: The network has to be invariant to observations of the same landmark at different viewpoints and lighting conditions, etc., but at the same time discriminate unrelated but similar observations. The latter becomes more relevant and severe in larger scenes. In this work, we tackle this problem by introducing the concept of co-visibility to the network. We propose GLACE, which integrates pre-trained global and local encodings and enables SCR to scale to large scenes with only a single small-sized network. Specifically, we propose a novel feature diffusion technique that implicitly groups the reprojection constraints with co-visibility and avoids overfitting to trivial solutions. Additionally, our position decoder parameterizes the output positions for large-scale scenes more effectively. Without using 3D models or depth maps for supervision, our method achieves state-of-the-art results on large-scale scenes with a low-map-size model. On Cambridge landmarks, with a single model, we achieve 17% lower median position error than Poker, the ensemble variant of the state-of-the-art SCR method ACE. Code is available at: https://github.com/cvg/glace.
Paper Structure (16 sections, 8 equations, 11 figures, 7 tables)

This paper contains 16 sections, 8 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Left: Quantitative comparison of map size and position error with state-of-the-art SCR methods brachmann2017dsacbrachmann2023accelerated on Cambridge landmarks kendall2015posenet. Our method outperforms DSAC brachmann2017dsac, ACE brachmann2023accelerated and Poker (4 ACE models) with a moderate model size. Right: Relationship between map size and position error. Note that our method with the smallest map size (3.2 MB) still performs better than Poker (4 ACE models, map size is 16.0 MB).
  • Figure 2: Pipleine of GLACE. Besides the buffer of ACE brachmann2023accelerated local encodings, we extract global features of training images with image retrieval model r2former. During training, we sample a batch of local encodings, look up their global encoding according to their image index and perform feature diffusion by adding Gaussian noise. The global and local encodings are concatenated as input to an MLP head. The output of the MLP is further processed by a position decoder to yield the final coordinate predictions. The global encoding with feature diffusion facilitates the grouping of reprojection constraints, enabling effective implicit triangulation in large-scale scenes. Best viewed when zoomed in.
  • Figure 3: Distribution of angular feature distance($^\circ$), conditioned on co-visibility. Two images are considered co-visible, if the number of co-visible points $n$ at least reaches a threshold $N$. The x-axis depicts the angular distance $d$ in degrees (left: N=15, right: N=100).
  • Figure 4: Distribution of co-visibility conditioned on the angular feature distance($^\circ$). Two images are considered co-visible, if the number of co-visible points $n$ at least reaches a threshold $N$. The x-axis depicts the angular threshold $D$ (left: N=15, right: N=100).
  • Figure 5: Comparison between decoder output of random Gaussian input samples. We use 50 cluster centers in this example of the Aachen dataset, shown in the top left (cluster assignments are color-coded, and cluster centers occur as red star).
  • ...and 6 more figures