Without Paired Labeled Data: End-to-End Self-Supervised Learning for Drone-view Geo-Localization
Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong, Guoqi Li
TL;DR
This work tackles drone-view geo-localization (DVGL) without relying on paired drone–satellite data by introducing DMNIL, an end-to-end self-supervised framework with a shallow backbone. It combines a dual-path contrastive baseline with two novel modules: Dynamic Hierarchical Memory Learning to enhance intra-view discriminability, and Information Consistency Evolution Learning to enforce cross-view alignment via neighborhood-driven constraints and mutual information optimization, plus a Pseudo-Label Enhancement strategy. On University-1652, SUES-200, and DenseUAV, DMNIL achieves state-of-the-art results among self-supervised methods and even matches or surpasses several supervised approaches, demonstrating strong cross-domain generalization and data efficiency. The approach offers practical impact for open-world DVGL by reducing annotation costs while maintaining high localization accuracy.
Abstract
Drone-view Geo-Localization (DVGL) aims to achieve accurate localization of drones by retrieving the most relevant GPS-tagged satellite images. However, most existing methods heavily rely on strictly pre-paired drone-satellite images for supervised learning. When the target region shifts, new paired samples are typically required to adapt to the distribution changes. The high cost of annotation and the limited transferability of these methods significantly hinder the practical deployment of DVGL in open-world scenarios. To address these limitations, we propose a novel end-to-end self-supervised learning method with a shallow backbone network, called the dynamic memory-driven and neighborhood information learning (DMNIL) method. It employs a clustering algorithm to generate pseudo-labels and adopts a dual-path contrastive learning framework to learn discriminative intra-view representations. Furthermore, DMNIL incorporates two core modules, including the dynamic hierarchical memory learning (DHML) module and the information consistency evolution learning (ICEL) module. The DHML module combines short-term and long-term memory to enhance intra-view feature consistency and discriminability. Meanwhile, the ICEL module utilizes a neighborhood-driven dynamic constraint mechanism to systematically capture implicit cross-view semantic correlations, consequently improving cross-view feature alignment. To further stabilize and strengthen the self-supervised training process, a pseudo-label enhancement strategy is introduced to enhance the quality of pseudo supervision. Extensive experiments on three public benchmark datasets demonstrate that the proposed method consistently outperforms existing self-supervised methods and even surpasses several state-of-the-art supervised methods. Our code is available at https://github.com/ISChenawei/DMNIL.
