Table of Contents
Fetching ...

Unifying UAV Cross-View Geo-Localization via 3D Geometric Perception

Haoyuan Li, Wen Yang, Fang Xu, Hong Tan, Haijian Zhang, Shengyang Li, Gui-Song Xia

Abstract

Cross-view geo-localization for Unmanned Aerial Vehicles (UAVs) operating in GNSS-denied environments remains challenging due to the severe geometric discrepancy between oblique UAV imagery and orthogonal satellite maps. Most existing methods address this problem through a decoupled pipeline of place retrieval and pose estimation, implicitly treating perspective distortion as appearance noise rather than an explicit geometric transformation. In this work, we propose a geometry-aware UAV geo-localization framework that explicitly models the 3D scene geometry to unify coarse place recognition and fine-grained pose estimation within a single inference pipeline. Our approach reconstructs a local 3D scene from multi-view UAV image sequences using a Visual Geometry Grounded Transformer (VGGT), and renders a virtual Bird's-Eye View (BEV) representation that orthorectifies the UAV perspective to align with satellite imagery. This BEV serves as a geometric intermediary that enables robust cross-view retrieval and provides spatial priors for accurate 3 Degrees of Freedom (3-DoF) pose regression. To efficiently handle multiple location hypotheses, we introduce a Satellite-wise Attention Block that isolates the interaction between each satellite candidate and the reconstructed UAV scene, preventing inter-candidate interference while maintaining linear computational complexity. In addition, we release a recalibrated version of the University-1652 dataset with precise coordinate annotations and spatial overlap analysis, enabling rigorous evaluation of end-to-end localization accuracy. Extensive experiments on the refined University-1652 benchmark and SUES-200 demonstrate that our method significantly outperforms state-of-the-art baselines, achieving robust meter-level localization accuracy and improved generalization in complex urban environments.

Unifying UAV Cross-View Geo-Localization via 3D Geometric Perception

Abstract

Cross-view geo-localization for Unmanned Aerial Vehicles (UAVs) operating in GNSS-denied environments remains challenging due to the severe geometric discrepancy between oblique UAV imagery and orthogonal satellite maps. Most existing methods address this problem through a decoupled pipeline of place retrieval and pose estimation, implicitly treating perspective distortion as appearance noise rather than an explicit geometric transformation. In this work, we propose a geometry-aware UAV geo-localization framework that explicitly models the 3D scene geometry to unify coarse place recognition and fine-grained pose estimation within a single inference pipeline. Our approach reconstructs a local 3D scene from multi-view UAV image sequences using a Visual Geometry Grounded Transformer (VGGT), and renders a virtual Bird's-Eye View (BEV) representation that orthorectifies the UAV perspective to align with satellite imagery. This BEV serves as a geometric intermediary that enables robust cross-view retrieval and provides spatial priors for accurate 3 Degrees of Freedom (3-DoF) pose regression. To efficiently handle multiple location hypotheses, we introduce a Satellite-wise Attention Block that isolates the interaction between each satellite candidate and the reconstructed UAV scene, preventing inter-candidate interference while maintaining linear computational complexity. In addition, we release a recalibrated version of the University-1652 dataset with precise coordinate annotations and spatial overlap analysis, enabling rigorous evaluation of end-to-end localization accuracy. Extensive experiments on the refined University-1652 benchmark and SUES-200 demonstrate that our method significantly outperforms state-of-the-art baselines, achieving robust meter-level localization accuracy and improved generalization in complex urban environments.

Paper Structure

This paper contains 24 sections, 18 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Framework Comparison. We show the comparison with existing two-stage UAV geo-localization methods and our unified UAV geo-localization framework.
  • Figure 2: Overview of the unified UAV geo-localization pipeline. Given a sequence of oblique UAV query images, the pipeline extracts the representation using VGGT to reconstruct a local 3D scene. Based on this shared representation, a satellite-aligned BEV is rendered to support coarse-level satellite retrieval. The retrieved satellite candidates are then independently aligned with the reconstructed 3D scene within the same feature space, enabling fine-grained regression of the absolute 3-DoF UAV pose (GPS coordinates and heading).
  • Figure 3: Satellite-wise Attention Block. The proposed satellite-wise attention block lets the satellite tokens attend to the UAV tokens independently, followed by the candidate validations.
  • Figure 4: Similarity Score of Features.
  • Figure 5: Geometric Relationship between UAV Camera and Satellite Tiles. Visualization of the University-Pose dataset geometry. The red arrow represents the 3-DoF UAV pose. Notably, the dataset contains spatially overlapped satellite tiles, providing dense candidates that facilitate robust pose estimation.
  • ...and 6 more figures