Table of Contents
Fetching ...

ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training

Leonard Bruns, Axel Barroso-Laguna, Tommaso Cavallari, Áron Monszpart, Sowmya Munukutla, Victor Adrian Prisacariu, Eric Brachmann

TL;DR

ACE-G tackles the generalization gap in scene coordinate regression by decoupling a scene-agnostic transformer-based coordinate regressor from a scene-specific map code. It introduces a dual pre-training regime that alternates between mapping (train both the regressor and map codes) and query (train only the regressor with fixed codes) to promote cross-scene generalization, and validates this with large-scale multi-scene pre-training and novel-scene mapping. The approach, built on a DINOv2-based image encoder and cross-attention transformer, achieves robust pose estimation across challenging indoor and outdoor datasets and outperforms several SCR baselines, while maintaining modest map-code sizes and reasonable mapping times. This framework advances practical, scalable learning-based relocalization by enabling generalization to unseen views and varying conditions, reducing reliance on per-scene optimization and expensive global reconstructions.

Abstract

Scene coordinate regression (SCR) has established itself as a promising learning-based approach to visual relocalization. After mere minutes of scene-specific training, SCR models estimate camera poses of query images with high accuracy. Still, SCR methods fall short of the generalization capabilities of more classical feature-matching approaches. When imaging conditions of query images, such as lighting or viewpoint, are too different from the training views, SCR models fail. Failing to generalize is an inherent limitation of previous SCR frameworks, since their training objective is to encode the training views in the weights of the coordinate regressor itself. The regressor essentially overfits to the training views, by design. We propose to separate the coordinate regressor and the map representation into a generic transformer and a scene-specific map code. This separation allows us to pre-train the transformer on tens of thousands of scenes. More importantly, it allows us to train the transformer to generalize from mapping images to unseen query images during pre-training. We demonstrate on multiple challenging relocalization datasets that our method, ACE-G, leads to significantly increased robustness while keeping the computational footprint attractive.

ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training

TL;DR

ACE-G tackles the generalization gap in scene coordinate regression by decoupling a scene-agnostic transformer-based coordinate regressor from a scene-specific map code. It introduces a dual pre-training regime that alternates between mapping (train both the regressor and map codes) and query (train only the regressor with fixed codes) to promote cross-scene generalization, and validates this with large-scale multi-scene pre-training and novel-scene mapping. The approach, built on a DINOv2-based image encoder and cross-attention transformer, achieves robust pose estimation across challenging indoor and outdoor datasets and outperforms several SCR baselines, while maintaining modest map-code sizes and reasonable mapping times. This framework advances practical, scalable learning-based relocalization by enabling generalization to unseen views and varying conditions, reducing reliance on per-scene optimization and expensive global reconstructions.

Abstract

Scene coordinate regression (SCR) has established itself as a promising learning-based approach to visual relocalization. After mere minutes of scene-specific training, SCR models estimate camera poses of query images with high accuracy. Still, SCR methods fall short of the generalization capabilities of more classical feature-matching approaches. When imaging conditions of query images, such as lighting or viewpoint, are too different from the training views, SCR models fail. Failing to generalize is an inherent limitation of previous SCR frameworks, since their training objective is to encode the training views in the weights of the coordinate regressor itself. The regressor essentially overfits to the training views, by design. We propose to separate the coordinate regressor and the map representation into a generic transformer and a scene-specific map code. This separation allows us to pre-train the transformer on tens of thousands of scenes. More importantly, it allows us to train the transformer to generalize from mapping images to unseen query images during pre-training. We demonstrate on multiple challenging relocalization datasets that our method, ACE-G, leads to significantly increased robustness while keeping the computational footprint attractive.

Paper Structure

This paper contains 38 sections, 4 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: ACE-G is trained with separate mapping and query splits explicitly optimizing scene coordinate regression for unseen views including changing conditions. The estimated scene coordinates for the query image and the estimated and ground-truth camera poses are shown. ACE-G estimates less noisy coordinates resulting in a more accurate pose estimate compared to ACE, which degrades for larger viewpoint or scene condition changes.
  • Figure 2: High-level comparison to ACE. The MLP in ACE is replaced by a coordinate regressor and a scene-specific map code.
  • Figure 3: Overview of ACE-G. The is pre-trained by alternating between mapping iterations (a) and query iterations (b). (a) During pre-training mapping iterations, both map codes and network weights are *using precomputed buffers * storing shuffled features and meta data necessary for supervision (cf.brachmann2023accelerated). For each mapping buffer a corresponding map code * is optimized. (b) During pre-training query iterations, the map codes are *and the network is trained to estimate scene coordinates for query buffers made up of viewpoints or scene conditions different from the mapping buffers'. (c) Once the regressor has been pre-trained, a novel scene can be encoded in a new map code by minimizing the reprojection error. (d) Given such an optimized map code and a new query image, scene coordinates and uncertainty can be estimated via a forward pass. The resulting 2D-3D correspondences can then be used to estimate the camera pose.
  • Figure 4: Network architecture. The scene-agnostic coordinate regressor consists of $N$ cross-attention-only blocks. Given a patch embedding $\boldsymbol{e}$ and a map code $\mathcal{C}$ it estimates the 3D scene coordinate $\boldsymbol{y}$ and uncertainty $\sigma_{\boldsymbol{y}}$.
  • Figure 5: Scene reconstruction. Comparing the estimated scene coordinates of ACE (left) and ACE-G (right) on Scene 1 of Indoor-6 shows that ACE fails to reconstruct some parts of the apartment.
  • ...and 4 more figures