Table of Contents
Fetching ...

Multi-Level Embedding and Alignment Network with Consistency and Invariance Learning for Cross-View Geo-Localization

Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong

TL;DR

Cross-View Geo-Localization (CVGL) faces large cross-view gaps and viewpoint variability, challenging robust cross-domain feature alignment with limited resources. The proposed Multi-Level Embedding and Alignment Network (MEAN) combines a lightweight ConvNeXt-Tiny backbone with three branches—Progressive Extension Embedding (PEE), Global Extension Embedding (GEE), and Cross-Domain Enhanced Alignment (CEA)—and optimizes with a multi-loss framework that includes $\\\ ext{L}_{\\text{CDA}}$, $\\\text{L}_{\\text{InfoNCE}}$, and $\\\text{L}_{\\text{CE}}$, i.e., $\\mathcal{L}_{\\text{total}} = \lambda_1 \\\mathcal{L}_{\\text{CDA}} + \lambda_2 \\\mathcal{L}_{\\text{InfoNCE}} + \\\lambda_3 \\\mathcal{L}_{\\text{CE}}$. MEAN achieves substantial efficiency gains—approximately a 62.17% reduction in parameters and a 70.99% reduction in GFLOPs—while maintaining or surpassing SOTA accuracy on University-1652 and SUES-200 and exhibiting strong cross-domain generalization. The approach advances CVGL by learning cross-view consistent and modality-invariant embeddings through progressive multi-level enhancement, global-local associations, and adaptive cross-domain calibration, with strong empirical validation and public release of code. This work enables robust CVGL under resource-constrained settings and points toward self-supervised extensions to reduce labeling needs.

Abstract

Cross-View Geo-Localization (CVGL) involves determining the localization of drone images by retrieving the most similar GPS-tagged satellite images. However, the imaging gaps between platforms are often significant and the variations in viewpoints are substantial, which limits the ability of existing methods to effectively associate cross-view features and extract consistent and invariant characteristics. Moreover, existing methods often overlook the problem of increased computational and storage requirements when improving model performance. To handle these limitations, we propose a lightweight enhanced alignment network, called the Multi-Level Embedding and Alignment Network (MEAN). The MEAN network uses a progressive multi-level enhancement strategy, global-to-local associations, and cross-domain alignment, enabling feature communication across levels. This allows MEAN to effectively connect features at different levels and learn robust cross-view consistent mappings and modality-invariant features. Moreover, MEAN adopts a shallow backbone network combined with a lightweight branch design, effectively reducing parameter count and computational complexity. Experimental results on the University-1652 and SUES-200 datasets demonstrate that MEAN reduces parameter count by 62.17% and computational complexity by 70.99% compared to state-of-the-art models, while maintaining competitive or even superior performance. Our code and models will be released on https://github.com/ISChenawei/MEAN.

Multi-Level Embedding and Alignment Network with Consistency and Invariance Learning for Cross-View Geo-Localization

TL;DR

Cross-View Geo-Localization (CVGL) faces large cross-view gaps and viewpoint variability, challenging robust cross-domain feature alignment with limited resources. The proposed Multi-Level Embedding and Alignment Network (MEAN) combines a lightweight ConvNeXt-Tiny backbone with three branches—Progressive Extension Embedding (PEE), Global Extension Embedding (GEE), and Cross-Domain Enhanced Alignment (CEA)—and optimizes with a multi-loss framework that includes , , and , i.e., . MEAN achieves substantial efficiency gains—approximately a 62.17% reduction in parameters and a 70.99% reduction in GFLOPs—while maintaining or surpassing SOTA accuracy on University-1652 and SUES-200 and exhibiting strong cross-domain generalization. The approach advances CVGL by learning cross-view consistent and modality-invariant embeddings through progressive multi-level enhancement, global-local associations, and adaptive cross-domain calibration, with strong empirical validation and public release of code. This work enables robust CVGL under resource-constrained settings and points toward self-supervised extensions to reduce labeling needs.

Abstract

Cross-View Geo-Localization (CVGL) involves determining the localization of drone images by retrieving the most similar GPS-tagged satellite images. However, the imaging gaps between platforms are often significant and the variations in viewpoints are substantial, which limits the ability of existing methods to effectively associate cross-view features and extract consistent and invariant characteristics. Moreover, existing methods often overlook the problem of increased computational and storage requirements when improving model performance. To handle these limitations, we propose a lightweight enhanced alignment network, called the Multi-Level Embedding and Alignment Network (MEAN). The MEAN network uses a progressive multi-level enhancement strategy, global-to-local associations, and cross-domain alignment, enabling feature communication across levels. This allows MEAN to effectively connect features at different levels and learn robust cross-view consistent mappings and modality-invariant features. Moreover, MEAN adopts a shallow backbone network combined with a lightweight branch design, effectively reducing parameter count and computational complexity. Experimental results on the University-1652 and SUES-200 datasets demonstrate that MEAN reduces parameter count by 62.17% and computational complexity by 70.99% compared to state-of-the-art models, while maintaining competitive or even superior performance. Our code and models will be released on https://github.com/ISChenawei/MEAN.

Paper Structure

This paper contains 21 sections, 16 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: The balance between model performance and parameter count. Model performance is evaluated based on the R@1 accuracy on the Drone$\rightarrow$Satellite from the University-1652 dataset. Our method achieves superior performance with a lower parameter count compared with state-of-the-art (SOTA) methods, demonstrating efficiency in CVGL tasks.
  • Figure 2: The pipeline of the proposed network includes a ConvNeXt-Tiny backbone and three core branches. The progressive extension embedding branch (PEE) learns multi-scale embedding features through progressive multi-scale convolutions optimized by the loss $\mathcal{L}_{\text{InfoNCE}}$ to enhance diverse feature representations and discriminative ability. The global extension embedding branch (GEE) aggregates global and locally generated embedding features optimized by the loss $\mathcal{L}_{\text{CE}}$. The cross-domain enhanced alignment branch (CEA) uses multi-level fusion and adaptive calibration strategy with a novel loss $\mathcal{L}_{\text{CDA}}$ to dynamically adjust feature consistency within a shared latent space of high-dimensional embeddings. For simplicity, let $i\in\{d,s\}$ denote drone view ($d$) and satellite view ($s$), and $\chi\in\{g,m\}$ represent diversified embedding generator (DEG) module ($g$) and Mean ($m$).
  • Figure 3: Illustration of Global Semantic and Local Geometric Feature Alignment Optimized by Our Proposed CDA Loss using Cosine Similarity and Mean Squared Error.
  • Figure 4: In the cross-view drone navigation task, (a-e) illustrate the intra-class and inter-class distances of features, where intra-class and inter-class distances are represented in blue and green, respectively. (f-j) depict the distribution of feature embeddings in the 2D feature space, with × and pentagrams representing aerial image features and satellite image features, respectively. A total of 40 locations were selected from the test set. Samples with the same color belong to the same location, while those with different colors indicate different locations.
  • Figure 5: Top-5 Retrieval Results of the Proposed MEAN on the University-1652 Dataset.
  • ...and 1 more figures