Drone Referring Localization: An Efficient Heterogeneous Spatial Feature Interaction Method For UAV Self-Localization
Ming Dai, Enhui Zheng, Jiahao Chen, Lei Qi, Zhenhua Feng, Wankou Yang
TL;DR
This work tackles GPS-denied UAV self-localization by reframing it as end-to-end heatmap localization through DRL, a framework that enables learnable interaction between heterogeneous UAV and satellite features. It introduces two fusion architectures, Post-Fusion and Mix-Fusion, with Mix-Fusion showing superior performance due to deeper cross-domain feature integration. Two data-centric contributions, Random Scale Crop (RSC) and Weighted Balance Loss (WBL), along with the UL14 paired dataset, enhance robustness and training dynamics, achieving notable gains in Meter-level Accuracy and Relative Distance Score while reducing inference time and storage compared to traditional image retrieval pipelines. The approach demonstrates strong generalization across satellite scales, query distributions, and flight altitudes, offering a practical, scalable solution for real-time UAV self-localization in challenging, GPS-denied environments.
Abstract
Image retrieval (IR) has emerged as a promising approach for self-localization in unmanned aerial vehicles (UAVs). However, IR-based methods face several challenges: 1) Pre- and post-processing incur significant computational and storage overhead; 2) The lack of interaction between dual-source features impairs precise spatial perception. In this paper, we propose an efficient heterogeneous spatial feature interaction method, termed Drone Referring Localization (DRL), which aims to localize UAV-view images within satellite imagery. Unlike conventional methods that treat different data sources in isolation, followed by cosine similarity computations, DRL facilitates the learnable interaction of heterogeneous features. To implement the proposed DRL, we design two transformer-based frameworks, Post-Fusion and Mix-Fusion, enabling end-to-end training and inference. Furthermore, we introduce random scale cropping and weight balance loss techniques to augment paired data and optimize the balance between positive and negative sample weights. Additionally, we construct a new dataset, UL14, and establish a benchmark tailored to the DRL framework. Compared to traditional IR methods, DRL achieves superior localization accuracy (MA@20 +9.4\%) while significantly reducing computational time (1/7) and storage overhead (1/3). The dataset and code will be made publicly available. The dataset and code are available at \url{https://github.com/Dmmm1997/DRL} .
