Table of Contents
Fetching ...

FG$^2$: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching

Zimin Xia, Alexandre Alahi

TL;DR

FG$^2$ tackles the problem of accurate $3$ DoF ground-to-aerial pose estimation by constructing a fine-grained BEV representation from a ground image through 3D lifting and height-aware feature pooling, then matching sparse ground–aerial points and solving pose with Procrustes. The method achieves strong localization accuracy gains over prior work, with a notable 28% reduction in mean localization error on the VIGOR cross-area setting, and provides interpretability by signaling which ground features contribute to the BEV representation. Training relies on weak supervision from camera pose, using a Virtual Correspondence Error and an infoNCE-based matching loss to enforce semantically consistent correspondences without ground-truth keypoint alignments. The approach generalizes across datasets (VIGOR and KITTI) and supports robust inference with optional RANSAC, offering practical improvements for scalable, interpretable cross-view localization in outdoor robotics and autonomous systems.

Abstract

We propose a novel fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image in an aerial image of the surroundings by matching fine-grained features between the two images. The pose is estimated by aligning a point plane generated from the ground image with a point plane sampled from the aerial image. To generate the ground points, we first map ground image features to a 3D point cloud. Our method then learns to select features along the height dimension to pool the 3D points to a Bird's-Eye-View (BEV) plane. This selection enables us to trace which feature in the ground image contributes to the BEV representation. Next, we sample a set of sparse matches from computed point correspondences between the two point planes and compute their relative pose using Procrustes alignment. Compared to the previous state-of-the-art, our method reduces the mean localization error by 28% on the VIGOR cross-area test set. Qualitative results show that our method learns semantically consistent matches across ground and aerial views through weakly supervised learning from the camera pose.

FG$^2$: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching

TL;DR

FG tackles the problem of accurate DoF ground-to-aerial pose estimation by constructing a fine-grained BEV representation from a ground image through 3D lifting and height-aware feature pooling, then matching sparse ground–aerial points and solving pose with Procrustes. The method achieves strong localization accuracy gains over prior work, with a notable 28% reduction in mean localization error on the VIGOR cross-area setting, and provides interpretability by signaling which ground features contribute to the BEV representation. Training relies on weak supervision from camera pose, using a Virtual Correspondence Error and an infoNCE-based matching loss to enforce semantically consistent correspondences without ground-truth keypoint alignments. The approach generalizes across datasets (VIGOR and KITTI) and supports robust inference with optional RANSAC, offering practical improvements for scalable, interpretable cross-view localization in outdoor robotics and autonomous systems.

Abstract

We propose a novel fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image in an aerial image of the surroundings by matching fine-grained features between the two images. The pose is estimated by aligning a point plane generated from the ground image with a point plane sampled from the aerial image. To generate the ground points, we first map ground image features to a 3D point cloud. Our method then learns to select features along the height dimension to pool the 3D points to a Bird's-Eye-View (BEV) plane. This selection enables us to trace which feature in the ground image contributes to the BEV representation. Next, we sample a set of sparse matches from computed point correspondences between the two point planes and compute their relative pose using Procrustes alignment. Compared to the previous state-of-the-art, our method reduces the mean localization error by 28% on the VIGOR cross-area test set. Qualitative results show that our method learns semantically consistent matches across ground and aerial views through weakly supervised learning from the camera pose.

Paper Structure

This paper contains 21 sections, 16 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Fine-grained cross-view localization estimates the 3 DoF pose of a ground image (left) on an aerial image (right) of the surroundings. Our method tackles this task by matching fine-grained local features across views, providing interpretable results. The matched correspondences are semantically consistent and learned without ground truth labels. Here are a few selected predictions.
  • Figure 1: Projecting a 3D point to a panoramic image. We use the original image for visualization purposes. In practice, we find the projected pixel coordinates in the extracted feature map.
  • Figure 2: Overview of our proposed method and the objectives of the loss functions used. We define two sets of points, $\xi^G$ and $\xi^A$, on a BEV plane, and our method generates a descriptor for each point. For the ground view, this involves mapping the image feature $f(G)$ to 3D and then selecting the important features along height. Next, we compute pairwise matching scores between the two point sets. The pose is then computed using Procrustes alignment based on the sampled matches. $\mathcal{L}_{\text{VCE}}$ minimizes the difference between a virtual point set transformed using the predicted pose and the ground truth pose. $\mathcal{L}_{M}$ encourages correspondencs found using the ground truth pose.
  • Figure 2: Perspective projection of a 3D point. We use the original image for visualization purposes. In practice, we find the projected pixel coordinates in the extracted feature map.
  • Figure 3: Fine-grained feature matching results on VIGOR with known orientation. We show the 20 matches with the highest similarity scores. We find the 3D points using the selected height in the last pooling to BEV step and then project those points to the ground image.
  • ...and 4 more figures