FG$^2$: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching
Zimin Xia, Alexandre Alahi
TL;DR
FG$^2$ tackles the problem of accurate $3$ DoF ground-to-aerial pose estimation by constructing a fine-grained BEV representation from a ground image through 3D lifting and height-aware feature pooling, then matching sparse ground–aerial points and solving pose with Procrustes. The method achieves strong localization accuracy gains over prior work, with a notable 28% reduction in mean localization error on the VIGOR cross-area setting, and provides interpretability by signaling which ground features contribute to the BEV representation. Training relies on weak supervision from camera pose, using a Virtual Correspondence Error and an infoNCE-based matching loss to enforce semantically consistent correspondences without ground-truth keypoint alignments. The approach generalizes across datasets (VIGOR and KITTI) and supports robust inference with optional RANSAC, offering practical improvements for scalable, interpretable cross-view localization in outdoor robotics and autonomous systems.
Abstract
We propose a novel fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image in an aerial image of the surroundings by matching fine-grained features between the two images. The pose is estimated by aligning a point plane generated from the ground image with a point plane sampled from the aerial image. To generate the ground points, we first map ground image features to a 3D point cloud. Our method then learns to select features along the height dimension to pool the 3D points to a Bird's-Eye-View (BEV) plane. This selection enables us to trace which feature in the ground image contributes to the BEV representation. Next, we sample a set of sparse matches from computed point correspondences between the two point planes and compute their relative pose using Procrustes alignment. Compared to the previous state-of-the-art, our method reduces the mean localization error by 28% on the VIGOR cross-area test set. Qualitative results show that our method learns semantically consistent matches across ground and aerial views through weakly supervised learning from the camera pose.
