Multi-Level Embedding and Alignment Network with Consistency and Invariance Learning for Cross-View Geo-Localization
Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong
TL;DR
Cross-View Geo-Localization (CVGL) faces large cross-view gaps and viewpoint variability, challenging robust cross-domain feature alignment with limited resources. The proposed Multi-Level Embedding and Alignment Network (MEAN) combines a lightweight ConvNeXt-Tiny backbone with three branches—Progressive Extension Embedding (PEE), Global Extension Embedding (GEE), and Cross-Domain Enhanced Alignment (CEA)—and optimizes with a multi-loss framework that includes $\\\ ext{L}_{\\text{CDA}}$, $\\\text{L}_{\\text{InfoNCE}}$, and $\\\text{L}_{\\text{CE}}$, i.e., $\\mathcal{L}_{\\text{total}} = \lambda_1 \\\mathcal{L}_{\\text{CDA}} + \lambda_2 \\\mathcal{L}_{\\text{InfoNCE}} + \\\lambda_3 \\\mathcal{L}_{\\text{CE}}$. MEAN achieves substantial efficiency gains—approximately a 62.17% reduction in parameters and a 70.99% reduction in GFLOPs—while maintaining or surpassing SOTA accuracy on University-1652 and SUES-200 and exhibiting strong cross-domain generalization. The approach advances CVGL by learning cross-view consistent and modality-invariant embeddings through progressive multi-level enhancement, global-local associations, and adaptive cross-domain calibration, with strong empirical validation and public release of code. This work enables robust CVGL under resource-constrained settings and points toward self-supervised extensions to reduce labeling needs.
Abstract
Cross-View Geo-Localization (CVGL) involves determining the localization of drone images by retrieving the most similar GPS-tagged satellite images. However, the imaging gaps between platforms are often significant and the variations in viewpoints are substantial, which limits the ability of existing methods to effectively associate cross-view features and extract consistent and invariant characteristics. Moreover, existing methods often overlook the problem of increased computational and storage requirements when improving model performance. To handle these limitations, we propose a lightweight enhanced alignment network, called the Multi-Level Embedding and Alignment Network (MEAN). The MEAN network uses a progressive multi-level enhancement strategy, global-to-local associations, and cross-domain alignment, enabling feature communication across levels. This allows MEAN to effectively connect features at different levels and learn robust cross-view consistent mappings and modality-invariant features. Moreover, MEAN adopts a shallow backbone network combined with a lightweight branch design, effectively reducing parameter count and computational complexity. Experimental results on the University-1652 and SUES-200 datasets demonstrate that MEAN reduces parameter count by 62.17% and computational complexity by 70.99% compared to state-of-the-art models, while maintaining competitive or even superior performance. Our code and models will be released on https://github.com/ISChenawei/MEAN.
