Table of Contents
Fetching ...

Map-Relative Pose Regression for Visual Re-Localization

Shuai Chen, Tommaso Cavallari, Victor Adrian Prisacariu, Eric Brachmann

TL;DR

marepo introduces a map-relative pose regression framework that couples a scene-specific geometry predictor with a scene-agnostic transformer-based regressor to predict metric camera poses from a query image. The method relies on a dynamic positional encoding that fuses camera intrinsics with 2D-3D embeddings, enabling robust 6-DoF pose estimation with end-to-end inference. Its key contributions include the two-component architecture, a novel dynamic encoding scheme, and a training paradigm that scales across hundreds of scenes while allowing minutes-scale fine-tuning for new environments. Evaluations on indoor 7-Scenes and outdoor Wayspots demonstrate substantial improvements over prior APR methods and competitive performance with geometry-based relocalization, highlighting strong generalization and practical applicability with fast adaptation and real-time inference.

Abstract

Pose regression networks predict the camera pose of a query image relative to a known environment. Within this family of methods, absolute pose regression (APR) has recently shown promising accuracy in the range of a few centimeters in position error. APR networks encode the scene geometry implicitly in their weights. To achieve high accuracy, they require vast amounts of training data that, realistically, can only be created using novel view synthesis in a days-long process. This process has to be repeated for each new scene again and again. We present a new approach to pose regression, map-relative pose regression (marepo), that satisfies the data hunger of the pose regression network in a scene-agnostic fashion. We condition the pose regressor on a scene-specific map representation such that its pose predictions are relative to the scene map. This allows us to train the pose regressor across hundreds of scenes to learn the generic relation between a scene-specific map representation and the camera pose. Our map-relative pose regressor can be applied to new map representations immediately or after mere minutes of fine-tuning for the highest accuracy. Our approach outperforms previous pose regression methods by far on two public datasets, indoor and outdoor. Code is available: https://nianticlabs.github.io/marepo

Map-Relative Pose Regression for Visual Re-Localization

TL;DR

marepo introduces a map-relative pose regression framework that couples a scene-specific geometry predictor with a scene-agnostic transformer-based regressor to predict metric camera poses from a query image. The method relies on a dynamic positional encoding that fuses camera intrinsics with 2D-3D embeddings, enabling robust 6-DoF pose estimation with end-to-end inference. Its key contributions include the two-component architecture, a novel dynamic encoding scheme, and a training paradigm that scales across hundreds of scenes while allowing minutes-scale fine-tuning for new environments. Evaluations on indoor 7-Scenes and outdoor Wayspots demonstrate substantial improvements over prior APR methods and competitive performance with geometry-based relocalization, highlighting strong generalization and practical applicability with fast adaptation and real-time inference.

Abstract

Pose regression networks predict the camera pose of a query image relative to a known environment. Within this family of methods, absolute pose regression (APR) has recently shown promising accuracy in the range of a few centimeters in position error. APR networks encode the scene geometry implicitly in their weights. To achieve high accuracy, they require vast amounts of training data that, realistically, can only be created using novel view synthesis in a days-long process. This process has to be repeated for each new scene again and again. We present a new approach to pose regression, map-relative pose regression (marepo), that satisfies the data hunger of the pose regression network in a scene-agnostic fashion. We condition the pose regressor on a scene-specific map representation such that its pose predictions are relative to the scene map. This allows us to train the pose regressor across hundreds of scenes to learn the generic relation between a scene-specific map representation and the camera pose. Our map-relative pose regressor can be applied to new map representations immediately or after mere minutes of fine-tuning for the highest accuracy. Our approach outperforms previous pose regression methods by far on two public datasets, indoor and outdoor. Code is available: https://nianticlabs.github.io/marepo
Paper Structure (33 sections, 7 equations, 3 figures, 10 tables)

This paper contains 33 sections, 7 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Camera pose estimation performance vs. mapping time. The figure shows the median translation error of several pose regression relocalization methods on the 7-Scenes dataset and the time required (proportional to the bubble size) to train each relocalizer on the target scenes. Our proposed approach, marepo, achieves superior performance -- by far -- on both metrics, thanks to its integration of scene-specific geometric map priors within an accurate, map-relative, pose regression framework.
  • Figure 2: Illustration of the marepo network. A scene-specific geometry prediction module $\mathcal{G_S}$ processes a query image to predict a scene coordinate map $\hat{H}$. Then, a scene-agnostic map-relative pose regressor $\mathcal{M}$ is used to directly regress the camera pose. Our network's training and inference rely solely on RGB images $I$ and camera intrinsics $K$ without requiring depth information or pre-built point clouds.
  • Figure 3: The map-relative pose regressor $\mathcal{M}$ takes as input a tensor of predicted scene coordinate maps and the corresponding camera intrinsics, embeds the information with dynamic positional encoding into higher dimensional features, and finally estimates the camera poses $\hat{P}$. During training, we also predict $\hat{P_0}$ and $\hat{P_1}$ for intermediate supervision.