Table of Contents
Fetching ...

GeLoc3r: Enhancing Relative Camera Pose Regression with Geometric Consistency Regularization

Jingxing Li, Yongjae Lee, Deliang Fan

TL;DR

GeLoc3r tackles the fundamental speed-accuracy gap in relative camera pose estimation by introducing Geometric Consistency Regularization (GCR), which uses privileged depth information during training to enforce geometric relationships in a regression-based pose estimator. The method preserves fast regression at inference (approximately 33 ms) while transferring geometric knowledge from a weighted, RANSAC-guided supervision pipeline into the network through a FusionTransformer-guided weighting of dense 3D-2D correspondences. Key contributions include a training-time geometric supervision framework (GCR) that combines direct pose consistency with a weighted RANSAC solver and an indirect descriptor-consistency signal, plus frozen MASt3R descriptor heads to anchor geometry. Experiments across CO3Dv2, RealEstate10K, MegaDepth1500, and unseen visual localization tasks demonstrate consistent improvements over ReLoc3R, achieving state-of-the-art regression performance with strong robustness and maintaining real-time inference, signaling a practical path to closing the gap with dense, correspondence-based methods.

Abstract

Prior ReLoc3R achieves breakthrough performance with fast 25ms inference and state-of-the-art regression accuracy, yet our analysis reveals subtle geometric inconsistencies in its internal representations that prevent reaching the precision ceiling of correspondence-based methods like MASt3R (which require 300ms per pair). In this work, we present GeLoc3r, a novel approach to relative camera pose estimation that enhances pose regression methods through Geometric Consistency Regularization (GCR). GeLoc3r overcomes the speed-accuracy dilemma by training regression networks to produce geometrically consistent poses without inference-time geometric computation. During training, GeLoc3r leverages ground-truth depth to generate dense 3D-2D correspondences, weights them using a FusionTransformer that learns correspondence importance, and computes geometrically-consistent poses via weighted RANSAC. This creates a consistency loss that transfers geometric knowledge into the regression network. Unlike FAR method which requires both regression and geometric solving at inference, GeLoc3r only uses the enhanced regression head at test time, maintaining ReLoc3R's fast speed and approaching MASt3R's high accuracy. On challenging benchmarks, GeLoc3r consistently outperforms ReLoc3R, achieving significant improvements including 40.45% vs. 34.85% AUC@5° on the CO3Dv2 dataset (16% relative improvement), 68.66% vs. 66.70% AUC@5° on RealEstate10K, and 50.45% vs. 49.60% on MegaDepth1500. By teaching geometric consistency during training rather than enforcing it at inference, GeLoc3r represents a paradigm shift in how neural networks learn camera geometry, achieving both the speed of regression and the geometric understanding of correspondence methods.

GeLoc3r: Enhancing Relative Camera Pose Regression with Geometric Consistency Regularization

TL;DR

GeLoc3r tackles the fundamental speed-accuracy gap in relative camera pose estimation by introducing Geometric Consistency Regularization (GCR), which uses privileged depth information during training to enforce geometric relationships in a regression-based pose estimator. The method preserves fast regression at inference (approximately 33 ms) while transferring geometric knowledge from a weighted, RANSAC-guided supervision pipeline into the network through a FusionTransformer-guided weighting of dense 3D-2D correspondences. Key contributions include a training-time geometric supervision framework (GCR) that combines direct pose consistency with a weighted RANSAC solver and an indirect descriptor-consistency signal, plus frozen MASt3R descriptor heads to anchor geometry. Experiments across CO3Dv2, RealEstate10K, MegaDepth1500, and unseen visual localization tasks demonstrate consistent improvements over ReLoc3R, achieving state-of-the-art regression performance with strong robustness and maintaining real-time inference, signaling a practical path to closing the gap with dense, correspondence-based methods.

Abstract

Prior ReLoc3R achieves breakthrough performance with fast 25ms inference and state-of-the-art regression accuracy, yet our analysis reveals subtle geometric inconsistencies in its internal representations that prevent reaching the precision ceiling of correspondence-based methods like MASt3R (which require 300ms per pair). In this work, we present GeLoc3r, a novel approach to relative camera pose estimation that enhances pose regression methods through Geometric Consistency Regularization (GCR). GeLoc3r overcomes the speed-accuracy dilemma by training regression networks to produce geometrically consistent poses without inference-time geometric computation. During training, GeLoc3r leverages ground-truth depth to generate dense 3D-2D correspondences, weights them using a FusionTransformer that learns correspondence importance, and computes geometrically-consistent poses via weighted RANSAC. This creates a consistency loss that transfers geometric knowledge into the regression network. Unlike FAR method which requires both regression and geometric solving at inference, GeLoc3r only uses the enhanced regression head at test time, maintaining ReLoc3R's fast speed and approaching MASt3R's high accuracy. On challenging benchmarks, GeLoc3r consistently outperforms ReLoc3R, achieving significant improvements including 40.45% vs. 34.85% AUC@5° on the CO3Dv2 dataset (16% relative improvement), 68.66% vs. 66.70% AUC@5° on RealEstate10K, and 50.45% vs. 49.60% on MegaDepth1500. By teaching geometric consistency during training rather than enforcing it at inference, GeLoc3r represents a paradigm shift in how neural networks learn camera geometry, achieving both the speed of regression and the geometric understanding of correspondence methods.

Paper Structure

This paper contains 21 sections, 9 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Cosine Similarity Error Analysis on MegaDepth1500. Top row: Input image pair with overlapping regions highlighted (green in Image 1, blue in Image 2). Bottom row: Cosine similarity error maps, obtained by projecting MASt3R's pixel-wise descriptor features from Image 2 onto Image 1 using predicted poses of ReLoc3R and GeLoc3r. ReLoc3R (left) exhibits widespread high errors (yellow regions), while GeLoc3r (middle) achieves significantly lower errors through geometric consistency regularization. The last plot shows GeLoc3r reduces mean error from 0.520 to 0.421, shifting the distribution toward lower values and mitigating pose regression inconsistency. Error maps and distributions are normalized for visualization.
  • Figure 2: GeLoc3r Architecture at Inference. The model processes image pairs $(I_1, I_2)$ through a shared ViT encoder followed by a ViT decoder with cross-attention. Three task-specific heads produce outputs: (1) a trainable pose head generates the relative camera pose $\mathbf{P}_{regression}$, (2-3) two frozen descriptor heads (pre-trained from MASt3R) output dense features $\mathbf{d}_1, \mathbf{d}_2$ and confidence maps $\mathbf{c}_1, \mathbf{c}_2$. At inference, only the pose regression output is used, while the descriptor outputs are only used during training.
  • Figure 3: GeLoc3r Training with Geometric Consistency Regularization (GCR). During training, the model leverages ground-truth depth maps and camera intrinsics as privileged information. Dense features from frozen descriptor heads are downsampled and concatenated to form correspondence embeddings $\mathbf{e}_j$, which are processed by the FusionTransformer to produce per-correspondence weights $\mathbf{w}$. Simultaneously, GT depth is unprojected to 3D points, transformed using the predicted pose $\mathbf{P}_{regression}$, and projected to form 3D-2D correspondence pairs $\mathcal{C}_{3D-2D}$ (detailed formation process in Appendix \ref{['appendix:3d2d_corres']}). The weighted RANSAC solver uses these correspondences with FusionTransformer weights and the regression pose as a prior to compute $\mathbf{P}_{solver}$. The consistency loss between $\mathbf{P}_{regression}$ and $\mathbf{P}_{solver}$ provides geometric supervision that teaches the regression network to produce geometrically consistent poses. The pink background highlights the training-only GCR module that is not used during inference.
  • Figure 4: Visual localization trajectory comparison on Cambridge Landmarks GreatCourt. We visualize localization results on 612 test poses spanning a 38.52m trajectory. The green line shows the ground truth camera path. Orange points represent predicted camera positions. Error lines connect each predicted pose to its corresponding ground truth position, where line length visualizes the error magnitude. (a) ReLoc3r exhibits numerous large-error outliers with errors ranging from 0.040m to 123.508m (mean: 5.096m). (b) GeLoc3r achieves visibly tighter alignment with errors ranging from 0.053m to 52.379m (mean: 2.826m), representing a 58% reduction in maximum error and a 45% reduction in mean error. The visualization clearly demonstrates that our geometric consistency regularization effectively mitigates extreme outliers in extended sequences.