Table of Contents
Fetching ...

GRLoc: Geometric Representation Regression for Visual Localization

Changyang Li, Xuejian Ma, Lixiang Liu, Zhan Li, Qingan Yan, Yi Xu

TL;DR

This work reframes absolute pose estimation as Geometric Representation Regression (GRR), shifting from directly regressing a 6-DoF pose to regressing two explicit geometric representations—a ray bundle for rotation and a 3D pointmap for translation—and recovering the pose with a differentiable solver. The architecture is deliberately decoupled into two branches, enabling rotation and translation to be optimized without mutual interference, and augmented with novel view synthesis (3DGS) and domain-adversarial training to improve generalization to real data. Through end-to-end training with pose, geometry, regularization, and domain losses, GRLoc achieves state-of-the-art performance on indoor 7-Scenes and outdoor Cambridge Landmarks, and shows strong compatibility with refinement approaches. The approach provides better interpretability and robustness by embedding a strong geometric prior and leveraging inverse rendering concepts, offering a scalable path toward generalizable visual localization. Overall, GRLoc demonstrates that modeling the inverse rendering process yields improved generalization and accuracy for absolute pose estimation in diverse environments.

Abstract

Absolute Pose Regression (APR) has emerged as a compelling paradigm for visual localization. However, APR models typically operate as black boxes, directly regressing a 6-DoF pose from a query image, which can lead to memorizing training views rather than understanding 3D scene geometry. In this work, we propose a geometrically-grounded alternative. Inspired by novel view synthesis, which renders images from intermediate geometric representations, we reformulate APR as its inverse that regresses the underlying 3D representations directly from the image, and we name this paradigm Geometric Representation Regression (GRR). Our model explicitly predicts two disentangled geometric representations in the world coordinate system: (1) a ray bundle's directions to estimate camera rotation, and (2) a corresponding pointmap to estimate camera translation. The final 6-DoF camera pose is then recovered from these geometric components using a differentiable deterministic solver. This disentangled approach, which separates the learned visual-to-geometry mapping from the final pose calculation, introduces a strong geometric prior into the network. We find that the explicit decoupling of rotation and translation predictions measurably boosts performance. We demonstrate state-of-the-art performance on 7-Scenes and Cambridge Landmarks datasets, validating that modeling the inverse rendering process is a more robust path toward generalizable absolute pose estimation.

GRLoc: Geometric Representation Regression for Visual Localization

TL;DR

This work reframes absolute pose estimation as Geometric Representation Regression (GRR), shifting from directly regressing a 6-DoF pose to regressing two explicit geometric representations—a ray bundle for rotation and a 3D pointmap for translation—and recovering the pose with a differentiable solver. The architecture is deliberately decoupled into two branches, enabling rotation and translation to be optimized without mutual interference, and augmented with novel view synthesis (3DGS) and domain-adversarial training to improve generalization to real data. Through end-to-end training with pose, geometry, regularization, and domain losses, GRLoc achieves state-of-the-art performance on indoor 7-Scenes and outdoor Cambridge Landmarks, and shows strong compatibility with refinement approaches. The approach provides better interpretability and robustness by embedding a strong geometric prior and leveraging inverse rendering concepts, offering a scalable path toward generalizable visual localization. Overall, GRLoc demonstrates that modeling the inverse rendering process yields improved generalization and accuracy for absolute pose estimation in diverse environments.

Abstract

Absolute Pose Regression (APR) has emerged as a compelling paradigm for visual localization. However, APR models typically operate as black boxes, directly regressing a 6-DoF pose from a query image, which can lead to memorizing training views rather than understanding 3D scene geometry. In this work, we propose a geometrically-grounded alternative. Inspired by novel view synthesis, which renders images from intermediate geometric representations, we reformulate APR as its inverse that regresses the underlying 3D representations directly from the image, and we name this paradigm Geometric Representation Regression (GRR). Our model explicitly predicts two disentangled geometric representations in the world coordinate system: (1) a ray bundle's directions to estimate camera rotation, and (2) a corresponding pointmap to estimate camera translation. The final 6-DoF camera pose is then recovered from these geometric components using a differentiable deterministic solver. This disentangled approach, which separates the learned visual-to-geometry mapping from the final pose calculation, introduces a strong geometric prior into the network. We find that the explicit decoupling of rotation and translation predictions measurably boosts performance. We demonstrate state-of-the-art performance on 7-Scenes and Cambridge Landmarks datasets, validating that modeling the inverse rendering process is a more robust path toward generalizable absolute pose estimation.

Paper Structure

This paper contains 26 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Localization errors on the 7-Scenes shotton2013scene and Cambridge kendall2015posenet datasets. Our GRLoc outperforms prior APR (green) methods and demonstrates competitive performance against leading RPR (orange) and PPR (blue) approaches. Our refined model GRLocref achieves comparable results with SOTA PPR methods.
  • Figure 2: Our Geometric Representation Regression (GRR) paradigm is inspired by Novel View Synthesis (NVS). While NVS performs the forward rendering process, GRR learns the inverse.
  • Figure 3: Geometric representations for an example $4 \times 4$ grid: (1) a ray bundle of world-space view directions and (2) a pointmap of 3D points.
  • Figure 4: An overview of our proposed architecture. Left: the query image is fed into a decoupled dual-branch network. Each branch's feature extractor produces spatial features for a prediction head and a global feature for a domain classifier. The ray head and point head predict their respective geometric representations. The final pose components are then analytically recovered from these representations using differentiable deterministic solvers. Right: The feature extractor fuses multi-level features from the backbone via $1 \times 1$ convolutions, resizing, and concatenation to produce the spatial feature, while the backbone's original final output is used as the global feature.
  • Figure 5: Visualization of GRLoc's predictions. $\dagger$: Cambridge Landmarks kendall2015posenet; $*$: 7-Scenes shotton2013scene. $\Lambda$ indicates scene-specific upper-limit of translation error. Sequential error bars show rotation errors.
  • ...and 1 more figures