Table of Contents
Fetching ...

$L^3$:Scene-agnostic Visual Localization in the Wild

Yu Zhang, Muhua Zhu, Yifei Xue, Tie Ji, Yizhen Lao

TL;DR

By performing direct online 3D reconstruction on RGB images, followed by two-stage metric scale recovery and pose refinement based on 2D-3D correspondences, L^3 achieves high accuracy without the need to pre-build or store any offline scene representations.

Abstract

Standard visual localization methods typically require offline pre-processing of scenes to obtain 3D structural information for better performance. This inevitably introduces additional computational and time costs, as well as the overhead of storing scene representations. Can we visually localize in a wild scene without any off-line preprocessing step? In this paper, we leverage the online inference capabilities of feed-forward 3D reconstruction networks to propose a novel map-free visual localization framework $L^3$. Specifically, by performing direct online 3D reconstruction on RGB images, followed by two-stage metric scale recovery and pose refinement based on 2D-3D correspondences, $L^3$ achieves high accuracy without the need to pre-build or store any offline scene representations. Extensive experiments demonstrate $L^3$ not only that the performance is comparable to state-of-the-art solutions on various benchmarks, but also that it exhibits significantly superior robustness in sparse scenes (fewer reference images per scene).

$L^3$:Scene-agnostic Visual Localization in the Wild

TL;DR

By performing direct online 3D reconstruction on RGB images, followed by two-stage metric scale recovery and pose refinement based on 2D-3D correspondences, L^3 achieves high accuracy without the need to pre-build or store any offline scene representations.

Abstract

Standard visual localization methods typically require offline pre-processing of scenes to obtain 3D structural information for better performance. This inevitably introduces additional computational and time costs, as well as the overhead of storing scene representations. Can we visually localize in a wild scene without any off-line preprocessing step? In this paper, we leverage the online inference capabilities of feed-forward 3D reconstruction networks to propose a novel map-free visual localization framework . Specifically, by performing direct online 3D reconstruction on RGB images, followed by two-stage metric scale recovery and pose refinement based on 2D-3D correspondences, achieves high accuracy without the need to pre-build or store any offline scene representations. Extensive experiments demonstrate not only that the performance is comparable to state-of-the-art solutions on various benchmarks, but also that it exhibits significantly superior robustness in sparse scenes (fewer reference images per scene).
Paper Structure (11 sections, 7 equations, 7 figures, 6 tables)

This paper contains 11 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison between scene-specific and scene-agnostic visual localization paradigms.Scene-specific methods require extensive offline preprocessing such as reconstruction or per-scene network training. In contrast, our proposed scene-agnostic visual localization framework $L^3$ directly estimates query poses via feed-forward coarse localization and PnP refinement, generalizing to novel scenes without requiring scene representations or preprocessing.
  • Figure 2: Comparison of dense and sparse scenes. Green denotes ground truth query poses and references are cyan. The predicted poses by different methods are shown in different colors: red for ACE brachmann2023accelerated, magenta for ACE+GS-CPR liugs, and blue for our proposed $L^3$. Dense Scene: Localization using all 1000 reference images. Sparse Scene: We sampled the 1000 reference images and retain only 20. Our $L^{3}$ significantly outperforms other baselines in this challenging case.
  • Figure 3: Overview of $L^3$.Coarse Localization: Given a query image $I_q$ and retrieved references $\{I_{r,i}\}_{i=1}^K$, we first perform feed-forward 3D reconstruction to predict local point clouds $\mathcal{P}_{r}^{local}$, query pose $\mathbf{P}_{q}^{local}$ and reference poses $\mathbf{P}_{r}^{local}$. A scale estimation module then computes the scale factor $S$ via a two-stage strategy to initialize the pose $\mathbf{P}_q^{\text{init}}$. Pose Refinement: The pose is refined via structure optimization and PnP to yield the final 6-DoF pose $\mathbf{P}_q$. Output: We compare the query image (top-left) with its rendering under the predicted pose (bottom-right), separated by a diagonal line.
  • Figure 4: Scale Estimation Strategy. Stage 1: Absolute depths are triangulated using ground truth (GT) reference poses. If the median ratio $S_{tri}$ between GT and local depths (from $\mathcal{P}_r^{local}$) has a deviation $d_{tri}$ below a threshold, $S_{tri}$ is adopted. Stage 2: Otherwise, we align the predicted trajectory $\textbf{P}^{local}_{r}$ with the GT trajectory $\textbf{P}^{GT}_{r}$ via rotation $R_{\text{align}}$. A RANSAC scheme then yields the scale $S_{traj}$ by minimizing Euclidean distance error. $S_{traj}$ is accepted if $d_{traj}$ < $d_{tri}$ (fallback to $S_{tri}$ otherwise).
  • Figure 5: Robustness analysis under increasing sparsity. We plot the log-scale translation error growth relative to the dense setting on three datasets. "✗" indicates localization failure beyond that sparsity level.
  • ...and 2 more figures