Table of Contents
Fetching ...

Without Paired Labeled Data: End-to-End Self-Supervised Learning for Drone-view Geo-Localization

Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong, Guoqi Li

TL;DR

This work tackles drone-view geo-localization (DVGL) without relying on paired drone–satellite data by introducing DMNIL, an end-to-end self-supervised framework with a shallow backbone. It combines a dual-path contrastive baseline with two novel modules: Dynamic Hierarchical Memory Learning to enhance intra-view discriminability, and Information Consistency Evolution Learning to enforce cross-view alignment via neighborhood-driven constraints and mutual information optimization, plus a Pseudo-Label Enhancement strategy. On University-1652, SUES-200, and DenseUAV, DMNIL achieves state-of-the-art results among self-supervised methods and even matches or surpasses several supervised approaches, demonstrating strong cross-domain generalization and data efficiency. The approach offers practical impact for open-world DVGL by reducing annotation costs while maintaining high localization accuracy.

Abstract

Drone-view Geo-Localization (DVGL) aims to achieve accurate localization of drones by retrieving the most relevant GPS-tagged satellite images. However, most existing methods heavily rely on strictly pre-paired drone-satellite images for supervised learning. When the target region shifts, new paired samples are typically required to adapt to the distribution changes. The high cost of annotation and the limited transferability of these methods significantly hinder the practical deployment of DVGL in open-world scenarios. To address these limitations, we propose a novel end-to-end self-supervised learning method with a shallow backbone network, called the dynamic memory-driven and neighborhood information learning (DMNIL) method. It employs a clustering algorithm to generate pseudo-labels and adopts a dual-path contrastive learning framework to learn discriminative intra-view representations. Furthermore, DMNIL incorporates two core modules, including the dynamic hierarchical memory learning (DHML) module and the information consistency evolution learning (ICEL) module. The DHML module combines short-term and long-term memory to enhance intra-view feature consistency and discriminability. Meanwhile, the ICEL module utilizes a neighborhood-driven dynamic constraint mechanism to systematically capture implicit cross-view semantic correlations, consequently improving cross-view feature alignment. To further stabilize and strengthen the self-supervised training process, a pseudo-label enhancement strategy is introduced to enhance the quality of pseudo supervision. Extensive experiments on three public benchmark datasets demonstrate that the proposed method consistently outperforms existing self-supervised methods and even surpasses several state-of-the-art supervised methods. Our code is available at https://github.com/ISChenawei/DMNIL.

Without Paired Labeled Data: End-to-End Self-Supervised Learning for Drone-view Geo-Localization

TL;DR

This work tackles drone-view geo-localization (DVGL) without relying on paired drone–satellite data by introducing DMNIL, an end-to-end self-supervised framework with a shallow backbone. It combines a dual-path contrastive baseline with two novel modules: Dynamic Hierarchical Memory Learning to enhance intra-view discriminability, and Information Consistency Evolution Learning to enforce cross-view alignment via neighborhood-driven constraints and mutual information optimization, plus a Pseudo-Label Enhancement strategy. On University-1652, SUES-200, and DenseUAV, DMNIL achieves state-of-the-art results among self-supervised methods and even matches or surpasses several supervised approaches, demonstrating strong cross-domain generalization and data efficiency. The approach offers practical impact for open-world DVGL by reducing annotation costs while maintaining high localization accuracy.

Abstract

Drone-view Geo-Localization (DVGL) aims to achieve accurate localization of drones by retrieving the most relevant GPS-tagged satellite images. However, most existing methods heavily rely on strictly pre-paired drone-satellite images for supervised learning. When the target region shifts, new paired samples are typically required to adapt to the distribution changes. The high cost of annotation and the limited transferability of these methods significantly hinder the practical deployment of DVGL in open-world scenarios. To address these limitations, we propose a novel end-to-end self-supervised learning method with a shallow backbone network, called the dynamic memory-driven and neighborhood information learning (DMNIL) method. It employs a clustering algorithm to generate pseudo-labels and adopts a dual-path contrastive learning framework to learn discriminative intra-view representations. Furthermore, DMNIL incorporates two core modules, including the dynamic hierarchical memory learning (DHML) module and the information consistency evolution learning (ICEL) module. The DHML module combines short-term and long-term memory to enhance intra-view feature consistency and discriminability. Meanwhile, the ICEL module utilizes a neighborhood-driven dynamic constraint mechanism to systematically capture implicit cross-view semantic correlations, consequently improving cross-view feature alignment. To further stabilize and strengthen the self-supervised training process, a pseudo-label enhancement strategy is introduced to enhance the quality of pseudo supervision. Extensive experiments on three public benchmark datasets demonstrate that the proposed method consistently outperforms existing self-supervised methods and even surpasses several state-of-the-art supervised methods. Our code is available at https://github.com/ISChenawei/DMNIL.

Paper Structure

This paper contains 18 sections, 34 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Motivation, Challenges, and Evaluation. (A-B): Motivation of the proposed DMNIL, which aims to explore the latent relationships between cross-view features, without relying on any paired data and multi-stage training and seeking intermediate transitional states to bridge the cross-view gap. (C-D): As representative cross-view retrieval tasks, ReID and DVGL share the common objective of establishing correspondences between query and gallery images based on discriminative feature representations. The notable progress achieved by ReID, especially in self-supervised learning scenarios, offers a valuable methodological foundation and insightful paradigms for advancing DVGL research. However, DVGL presents greater challenges for effective feature learning and discrimination compared to ReID. (C): ReID tasks are characterized by relatively small intra-class variations and distinct inter-class differences, which facilitates effective feature discrimination. (D): In contrast, DVGL tasks suffer from large intra-class variations and small inter-class differences, leading to cross-view gaps and intra-view ambiguities that hinder feature discrimination. (E): The Recall@k performance of the solely used backbone ConvNeXt-Tiny liu2022convnet that is pre-trained on ImageNet-22kdeng2009imagenet is presented on two benchmarks, where the evaluation is conducted on the DVGL dataset University-1652zheng2020university and the four person ReID datasets including MSMT17wei2018person, DukeMTMC-reIDzheng2017discriminatively, CUHK03sun2014deep, and Market1501zheng2015scalable. (F): Performance of DMNIL is compared with state-of-the-art self-supervised and supervised methods, which outperforms existing state-of-the-art self-supervised methods and even surpasses several supervised methods.
  • Figure 2: Pipeline Overview. The DMNIL consists of a lightweight backbone, a dual-path contrastive learning strategy, a dynamic hierarchical memory learning module, and an information consistency evolution module. Specifically, the dual-path contrastive learning is designed to learn discriminative and consistent intra-view feature representations. The dynamic hierarchical memory module further captures intra-view feature variations under different viewpoints and scales, thereby enhancing the robustness and discriminability of the learned representations. The information consistency evolution module focuses on modeling cross-view feature consistency through a neighborhood-driven learning strategy, and further improves the training process by integrating a pseudo-label enhancement strategy.
  • Figure 3: Accuracy of Pseudo-Labels. (a-b): Evolution of pseudo-label quality during training on the University-1652. (c-d): Evolution of pseudo-label quality during training on the DenseUAV datasets. "Pseudo-labels (w/o PLE)" and "Pseudo-labels (w/ PLE)" denote models trained without and with the proposed pseudo-label enhancement (PLE) strategy, respectively. The upper plots show the number of valid samples, correct and error pseudo-labels, and outliers obtained at each epoch, while the lower plots present the corresponding pseudo-label accuracy. The abbreviations "Max," "Min," and "Avg" in the table denote the maximum, minimum, and average values, respectively.
  • Figure 4: Performance influence of momentum factors. (a) Performance influence of the momentum factor $\alpha$ in the baseline on the University-1652. (b) Performance influence of the momentum factor $\xi$ in the DHML module on the University-1652.
  • Figure 5: Performance influence of the parameters $k_1$ and $k_2$.(a) Performance influence of the $k_1$ in the ICEL module on the University-1652 dataset. (b) Performance influence of the parameter $k_2$ in the ICEL module on the University-1652 dataset.
  • ...and 3 more figures