Table of Contents
Fetching ...

(MGS)$^2$-Net: Unifying Micro-Geometric Scale and Macro-Geometric Structure for Cross-View Geo-Localization

Minglei Li, Mengfan He, Chao Chen, Ziyang Meng

TL;DR

This paper addresses cross-view geo-localization under severe 3D geometric misalignment by introducing a geometry-grounded framework, (MGS)², that couples Macro-Geometric Structure Filtering (MGSF) with Micro-Geometric Scale Adaptation (MGSA) and a Geometric-Appearance Contrastive Distillation (GACD) loss. MGSA uses depth priors to dynamically fuse multi-scale features, while MGSF uses depth-derived macro-gradients and normal clustering to suppress view-dependent vertical facades and emphasize horizontal planes. GACD enforces a geometric-prior-based ranking, rewarding roof-centric activations over facade distractions, and the optimization combines L_triplet with L_GACD. Experiments on University-1652 and SUES-200 show state-of-the-art Recall@1 and strong cross-dataset generalization, with qualitative visualizations confirming effective suppression of vertical artifacts. Overall, the approach demonstrates that explicit 3D structure priors substantially improve cross-view localization in urban environments and offers a path toward robust, geometry-aware CVGL systems in GNSS-denied scenarios.

Abstract

Cross-view geo-localization (CVGL) is pivotal for GNSS-denied UAV navigation but remains brittle under the drastic geometric misalignment between oblique aerial views and orthographic satellite references. Existing methods predominantly operate within a 2D manifold, neglecting the underlying 3D geometry where view-dependent vertical facades (macro-structure) and scale variations (micro-scale) severely corrupt feature alignment. To bridge this gap, we propose (MGS)$^2$, a geometry-grounded framework. The core of our innovation is the Macro-Geometric Structure Filtering (MGSF) module. Unlike pixel-wise matching sensitive to noise, MGSF leverages dilated geometric gradients to physically filter out high-frequency facade artifacts while enhancing the view-invariant horizontal plane, directly addressing the domain shift. To guarantee robust input for this structural filtering, we explicitly incorporate a Micro-Geometric Scale Adaptation (MGSA) module. MGSA utilizes depth priors to dynamically rectify scale discrepancies via multi-branch feature fusion. Furthermore, a Geometric-Appearance Contrastive Distillation (GACD) loss is designed to strictly discriminate against oblique occlusions. Extensive experiments demonstrate that (MGS)$^2$ achieves state-of-the-art performance, recording a Recall@1 of 97.5\% on University-1652 and 97.02\% on SUES-200. Furthermore, the framework exhibits superior cross-dataset generalization against geometric ambiguity. The code is available at: \href{https://github.com/GabrielLi1473/MGS-Net}{https://github.com/GabrielLi1473/MGS-Net}.

(MGS)$^2$-Net: Unifying Micro-Geometric Scale and Macro-Geometric Structure for Cross-View Geo-Localization

TL;DR

This paper addresses cross-view geo-localization under severe 3D geometric misalignment by introducing a geometry-grounded framework, (MGS)², that couples Macro-Geometric Structure Filtering (MGSF) with Micro-Geometric Scale Adaptation (MGSA) and a Geometric-Appearance Contrastive Distillation (GACD) loss. MGSA uses depth priors to dynamically fuse multi-scale features, while MGSF uses depth-derived macro-gradients and normal clustering to suppress view-dependent vertical facades and emphasize horizontal planes. GACD enforces a geometric-prior-based ranking, rewarding roof-centric activations over facade distractions, and the optimization combines L_triplet with L_GACD. Experiments on University-1652 and SUES-200 show state-of-the-art Recall@1 and strong cross-dataset generalization, with qualitative visualizations confirming effective suppression of vertical artifacts. Overall, the approach demonstrates that explicit 3D structure priors substantially improve cross-view localization in urban environments and offers a path toward robust, geometry-aware CVGL systems in GNSS-denied scenarios.

Abstract

Cross-view geo-localization (CVGL) is pivotal for GNSS-denied UAV navigation but remains brittle under the drastic geometric misalignment between oblique aerial views and orthographic satellite references. Existing methods predominantly operate within a 2D manifold, neglecting the underlying 3D geometry where view-dependent vertical facades (macro-structure) and scale variations (micro-scale) severely corrupt feature alignment. To bridge this gap, we propose (MGS), a geometry-grounded framework. The core of our innovation is the Macro-Geometric Structure Filtering (MGSF) module. Unlike pixel-wise matching sensitive to noise, MGSF leverages dilated geometric gradients to physically filter out high-frequency facade artifacts while enhancing the view-invariant horizontal plane, directly addressing the domain shift. To guarantee robust input for this structural filtering, we explicitly incorporate a Micro-Geometric Scale Adaptation (MGSA) module. MGSA utilizes depth priors to dynamically rectify scale discrepancies via multi-branch feature fusion. Furthermore, a Geometric-Appearance Contrastive Distillation (GACD) loss is designed to strictly discriminate against oblique occlusions. Extensive experiments demonstrate that (MGS) achieves state-of-the-art performance, recording a Recall@1 of 97.5\% on University-1652 and 97.02\% on SUES-200. Furthermore, the framework exhibits superior cross-dataset generalization against geometric ambiguity. The code is available at: \href{https://github.com/GabrielLi1473/MGS-Net}{https://github.com/GabrielLi1473/MGS-Net}.
Paper Structure (33 sections, 15 equations, 4 figures, 4 tables)

This paper contains 33 sections, 15 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: From Texture Dependency to Geometric Grounding. (a) Visual Ambiguity: Existing methods often overfit to view-dependent vertical facades (red boxes) that are invisible in the satellite orthophoto, leading to retrieval failure. (b) Our Solution: By explicitly modeling 3D macro-geometric structures, our MGS-Net filters out these vertical artifacts via MGSF and robustly focuses on view-invariant rooftops (green boxes), ensuring consistent cross-view alignment. The image materials are from the University-1652 dataset.
  • Figure 2: The overall framework of the proposed (MGS)². It mainly consists of the Micro-Geometric Scale Adaptation (MGSA) module to handle scale variations and the Macro-Geometric Structure Filtering (MGSF) module to suppress view-dependent artifacts.
  • Figure 3: Visualization of the Macro-Geometric Structure Filtering (MGSF) Mechanism. We visualize the feature response maps before and after applying MGSF. (a) Original Features: The backbone naively activates on high-frequency vertical textures. (b) Enhanced Features: MGSF redistributes attention based on geometric priors. Note that black boxes indicate the effective suppression of view-dependent vertical facades, while white boxes highlight the adaptive enhancement of view-invariant Horizontal planes.
  • Figure 4: Qualitative image retrieval results for the University-1652 dataset. (Top) Top 5 retrieval results for target localization in drone view. (Bottom) Top 5 retrieval results for drone navigation. The first and third rows show retrieval results for (MGS)²; the second and fourth rows show retrieval results for the baseline (DINOv2 + SALAD). Green boxes indicate correctly-matched images; red boxes indicate incorrectly-matched images.