View Distribution Alignment with Progressive Adversarial Learning for UAV Visual Geo-Localization
Cuiwei Liu, Jiahao Liu, Huaijun Qiu, Zhaokui Li, Xiangbin Shi
TL;DR
The paper introduces PVDA, an end-to-end framework for UAV visual geo-localization that directly addresses the distribution gap between UAV-view and satellite-view images. It combines a shared ResNet-50 feature encoder, a multi-branch location classifier, and a view discriminator, trained with a progressive adversarial strategy that gradually emphasizes view-invariance while preserving location-discriminative power. The method achieves state-of-the-art results on the University-1652 dataset for both UAV-to-satellite and satellite-to-UAV tasks, with competitive inference time and robustness to unseen locations. The approach demonstrates the practical potential of distribution alignment in cross-view image retrieval for geo-localization applications.
Abstract
Unmanned Aerial Vehicle (UAV) visual geo-localization aims to match images of the same geographic target captured from different views, i.e., the UAV view and the satellite view. It is very challenging due to the large appearance differences in UAV-satellite image pairs. Previous works map images captured by UAVs and satellites to a shared feature space and employ a classification framework to learn location-dependent features while neglecting the overall distribution shift between the UAV view and the satellite view. In this paper, we address these limitations by introducing distribution alignment of the two views to shorten their distance in a common space. Specifically, we propose an end-to-end network, called PVDA (Progressive View Distribution Alignment). During training, feature encoder, location classifier, and view discriminator are jointly optimized by a novel progressive adversarial learning strategy. Competition between feature encoder and view discriminator prompts both of them to be stronger. It turns out that the adversarial learning is progressively emphasized until UAV-view images are indistinguishable from satellite-view images. As a result, the proposed PVDA becomes powerful in learning location-dependent yet view-invariant features with good scalability towards unseen images of new locations. Compared to the state-of-the-art methods, the proposed PVDA requires less inference time but has achieved superior performance on the University-1652 dataset.
