Table of Contents
Fetching ...

Loc$^2$: Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching

Zimin Xia, Chenghao Xu, Alexandre Alahi

TL;DR

An accurate and interpretable fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image by matching its local features with a reference aerial image that directly learns ground-aerial image-plane correspondences using weak supervision from camera poses.

Abstract

We propose an accurate and interpretable fine-grained cross-view localization method that estimates the 3 Degrees of Freedom (DoF) pose of a ground-level image by matching its local features with a reference aerial image. Unlike prior approaches that rely on global descriptors or bird's-eye-view (BEV) transformations, our method directly learns ground-aerial image-plane correspondences using weak supervision from camera poses. The matched ground points are lifted into BEV space with monocular depth predictions, and scale-aware Procrustes alignment is then applied to estimate camera rotation, translation, and optionally the scale between relative depth and the aerial metric space. This formulation is lightweight, end-to-end trainable, and requires no pixel-level annotations. Experiments show state-of-the-art accuracy in challenging scenarios such as cross-area testing and unknown orientation. Furthermore, our method offers strong interpretability: correspondence quality directly reflects localization accuracy and enables outlier rejection via RANSAC, while overlaying the re-scaled ground layout on the aerial image provides an intuitive visual cue of localization performance.

Loc$^2$: Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching

TL;DR

An accurate and interpretable fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image by matching its local features with a reference aerial image that directly learns ground-aerial image-plane correspondences using weak supervision from camera poses.

Abstract

We propose an accurate and interpretable fine-grained cross-view localization method that estimates the 3 Degrees of Freedom (DoF) pose of a ground-level image by matching its local features with a reference aerial image. Unlike prior approaches that rely on global descriptors or bird's-eye-view (BEV) transformations, our method directly learns ground-aerial image-plane correspondences using weak supervision from camera poses. The matched ground points are lifted into BEV space with monocular depth predictions, and scale-aware Procrustes alignment is then applied to estimate camera rotation, translation, and optionally the scale between relative depth and the aerial metric space. This formulation is lightweight, end-to-end trainable, and requires no pixel-level annotations. Experiments show state-of-the-art accuracy in challenging scenarios such as cross-area testing and unknown orientation. Furthermore, our method offers strong interpretability: correspondence quality directly reflects localization accuracy and enables outlier rejection via RANSAC, while overlaying the re-scaled ground layout on the aerial image provides an intuitive visual cue of localization performance.

Paper Structure

This paper contains 27 sections, 17 equations, 14 figures, 16 tables.

Figures (14)

  • Figure 1: Loc$^{2}$: Interpretable cross-view localization via local feature matching. Loc$^{2}$ establishes accurate correspondences between aerial and ground views, with colors indicating distinct correspondence regions. Using the estimated rotation, translation, and scale, the ground view is warped onto the aerial image, providing a visual interpretation of localization quality.
  • Figure 2: Overview of our proposed method. Our method first matches local features between ground and aerial images. The matched ground points are then lifted to the BEV space using monocular depth priors. By aligning these correspondences using scale-aware Procrustes alignment, we estimate the rotation, translation, and scale between the ground and aerial views.
  • Figure 3: Local feature matching results on the VIGOR same-area test set under unknown orientation. We visualize the top 50 correspondences, ranked by matching score.
  • Figure 4: Outlier detection using RANSAC on VIGOR same/cross-area test sets.
  • Figure 5: Ground layout overlaid on the aerial image after applying the predicted rotation, translation, and scale transformations. The alignment directly reflects localization quality: the first three examples show successful localization, while the last one illustrates a failure case. Notably, the alignment in example (c) helped us to identify the error in ground truth location.
  • ...and 9 more figures