Table of Contents
Fetching ...

Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization

Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, Yanchao Yang

TL;DR

Reloc3r tackles the challenge of robust visual localization by learning relative camera poses with a large-scale, symmetric relative pose regression network and a minimalist motion averaging module to compute absolute poses. Trained on around eight million image pairs across object-centric, indoor, and outdoor scenes, the approach achieves real-time performance and strong generalization across six public datasets. The key innovations are the fully symmetric ViT-based RPR, non-trainable motion averaging, and the emphasis on scale-free translation directions learned via relative poses. Empirical results show state-of-the-art or competitive performance in both relative pose estimation and absolute pose localization, with significant speed advantages over many baselines, illustrating the practicality of large-scale training for pose regression.

Abstract

Visual localization aims to determine the camera pose of a query image relative to a database of posed images. In recent years, deep neural networks that directly regress camera poses have gained popularity due to their fast inference capabilities. However, existing methods struggle to either generalize well to new scenes or provide accurate camera pose estimates. To address these issues, we present Reloc3r, a simple yet effective visual localization framework. It consists of an elegantly designed relative pose regression network, and a minimalist motion averaging module for absolute pose estimation. Trained on approximately eight million posed image pairs, Reloc3r achieves surprisingly good performance and generalization ability. We conduct extensive experiments on six public datasets, consistently demonstrating the effectiveness and efficiency of the proposed method. It provides high-quality camera pose estimates in real time and generalizes to novel scenes. Code: https://github.com/ffrivera0/reloc3r.

Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization

TL;DR

Reloc3r tackles the challenge of robust visual localization by learning relative camera poses with a large-scale, symmetric relative pose regression network and a minimalist motion averaging module to compute absolute poses. Trained on around eight million image pairs across object-centric, indoor, and outdoor scenes, the approach achieves real-time performance and strong generalization across six public datasets. The key innovations are the fully symmetric ViT-based RPR, non-trainable motion averaging, and the emphasis on scale-free translation directions learned via relative poses. Empirical results show state-of-the-art or competitive performance in both relative pose estimation and absolute pose localization, with significant speed advantages over many baselines, illustrating the practicality of large-scale training for pose regression.

Abstract

Visual localization aims to determine the camera pose of a query image relative to a database of posed images. In recent years, deep neural networks that directly regress camera poses have gained popularity due to their fast inference capabilities. However, existing methods struggle to either generalize well to new scenes or provide accurate camera pose estimates. To address these issues, we present Reloc3r, a simple yet effective visual localization framework. It consists of an elegantly designed relative pose regression network, and a minimalist motion averaging module for absolute pose estimation. Trained on approximately eight million posed image pairs, Reloc3r achieves surprisingly good performance and generalization ability. We conduct extensive experiments on six public datasets, consistently demonstrating the effectiveness and efficiency of the proposed method. It provides high-quality camera pose estimates in real time and generalizes to novel scenes. Code: https://github.com/ffrivera0/reloc3r.

Paper Structure

This paper contains 16 sections, 4 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Comparison of pose accuracy and runtime efficiency. We report the AUC@5 and image pairs per second (FPS) on the ScanNet1500 dai2017scannetsarlin2020superglue dataset. We provide two versions of Reloc3r: one trained and tested on image widths of 512, and another on 224. The proposed Reloc3r-512 outperforms all other methods, achieving the best AUC@5 while maintaining an efficiency of 24 FPS. Remarkably, even at 224 resolution, our method matches ROMA edstedt2024roma in accuracy while being 20$\times$ faster.
  • Figure 2: Reloc3r consists of two modules: a relative camera pose regression network (Sec. \ref{['sec:rpr']}) and a motion averaging module (Sec. \ref{['sec:ma']}). Given a pair of input images, the network module infers the relative camera pose (at an unknown scale) between them. This module consists of a two-branch Vision Transformer (ViT) with shared weights. The images are divided into patches, converted to tokens, and embedded as latent features through separate encoders. Decoders then exchange information between the two sets of latent features. Each head aggregates its latent features to estimate a relative camera pose. To determine the absolute camera pose of a query image relative to a database, we retrieve at least two database-query pairs. These pairs are first processed by the network for relative pose estimation. Subsequently, the motion averaging module computes the absolute metric pose by aggregating the relative estimates.
  • Figure 3: We visualize pose estimates for two scenes: Chess from the 7 Scenes dataset shotton2013scene and KingsCollege from Cambridge Landmarks kendall2015posenet. We compare Reloc3r's results with those of the most closely related RPR methods: ExReNet winkelbauer2021learning and Map-free arnold2022map. We can observe that Reloc3r's pose estimates align more closely with the ground-truth poses.
  • Figure 4: The top row showcases matches from Efficient LoFTR, while the bottom row displays the top-3 cross-attention responses from Reloc3r's decoder. We observe that the correlated regions in Reloc3r are superior to those of Efficient LoFTR, even though Reloc3r is trained solely with pose supervision.
  • Figure 5: Our pose regression network encounters failure cases when significant changes in focal length occur. As shown in the figure, there are 3$\times$ to 4$\times$ zoom in / out effects. While rotation estimates remain largely unaffected, translation becomes noticeably inaccurate. This issue is similar to the scale-distance ambiguity problem in two-view geometry.
  • ...and 4 more figures