Table of Contents
Fetching ...

Drone Referring Localization: An Efficient Heterogeneous Spatial Feature Interaction Method For UAV Self-Localization

Ming Dai, Enhui Zheng, Jiahao Chen, Lei Qi, Zhenhua Feng, Wankou Yang

TL;DR

This work tackles GPS-denied UAV self-localization by reframing it as end-to-end heatmap localization through DRL, a framework that enables learnable interaction between heterogeneous UAV and satellite features. It introduces two fusion architectures, Post-Fusion and Mix-Fusion, with Mix-Fusion showing superior performance due to deeper cross-domain feature integration. Two data-centric contributions, Random Scale Crop (RSC) and Weighted Balance Loss (WBL), along with the UL14 paired dataset, enhance robustness and training dynamics, achieving notable gains in Meter-level Accuracy and Relative Distance Score while reducing inference time and storage compared to traditional image retrieval pipelines. The approach demonstrates strong generalization across satellite scales, query distributions, and flight altitudes, offering a practical, scalable solution for real-time UAV self-localization in challenging, GPS-denied environments.

Abstract

Image retrieval (IR) has emerged as a promising approach for self-localization in unmanned aerial vehicles (UAVs). However, IR-based methods face several challenges: 1) Pre- and post-processing incur significant computational and storage overhead; 2) The lack of interaction between dual-source features impairs precise spatial perception. In this paper, we propose an efficient heterogeneous spatial feature interaction method, termed Drone Referring Localization (DRL), which aims to localize UAV-view images within satellite imagery. Unlike conventional methods that treat different data sources in isolation, followed by cosine similarity computations, DRL facilitates the learnable interaction of heterogeneous features. To implement the proposed DRL, we design two transformer-based frameworks, Post-Fusion and Mix-Fusion, enabling end-to-end training and inference. Furthermore, we introduce random scale cropping and weight balance loss techniques to augment paired data and optimize the balance between positive and negative sample weights. Additionally, we construct a new dataset, UL14, and establish a benchmark tailored to the DRL framework. Compared to traditional IR methods, DRL achieves superior localization accuracy (MA@20 +9.4\%) while significantly reducing computational time (1/7) and storage overhead (1/3). The dataset and code will be made publicly available. The dataset and code are available at \url{https://github.com/Dmmm1997/DRL} .

Drone Referring Localization: An Efficient Heterogeneous Spatial Feature Interaction Method For UAV Self-Localization

TL;DR

This work tackles GPS-denied UAV self-localization by reframing it as end-to-end heatmap localization through DRL, a framework that enables learnable interaction between heterogeneous UAV and satellite features. It introduces two fusion architectures, Post-Fusion and Mix-Fusion, with Mix-Fusion showing superior performance due to deeper cross-domain feature integration. Two data-centric contributions, Random Scale Crop (RSC) and Weighted Balance Loss (WBL), along with the UL14 paired dataset, enhance robustness and training dynamics, achieving notable gains in Meter-level Accuracy and Relative Distance Score while reducing inference time and storage compared to traditional image retrieval pipelines. The approach demonstrates strong generalization across satellite scales, query distributions, and flight altitudes, offering a practical, scalable solution for real-time UAV self-localization in challenging, GPS-denied environments.

Abstract

Image retrieval (IR) has emerged as a promising approach for self-localization in unmanned aerial vehicles (UAVs). However, IR-based methods face several challenges: 1) Pre- and post-processing incur significant computational and storage overhead; 2) The lack of interaction between dual-source features impairs precise spatial perception. In this paper, we propose an efficient heterogeneous spatial feature interaction method, termed Drone Referring Localization (DRL), which aims to localize UAV-view images within satellite imagery. Unlike conventional methods that treat different data sources in isolation, followed by cosine similarity computations, DRL facilitates the learnable interaction of heterogeneous features. To implement the proposed DRL, we design two transformer-based frameworks, Post-Fusion and Mix-Fusion, enabling end-to-end training and inference. Furthermore, we introduce random scale cropping and weight balance loss techniques to augment paired data and optimize the balance between positive and negative sample weights. Additionally, we construct a new dataset, UL14, and establish a benchmark tailored to the DRL framework. Compared to traditional IR methods, DRL achieves superior localization accuracy (MA@20 +9.4\%) while significantly reducing computational time (1/7) and storage overhead (1/3). The dataset and code will be made publicly available. The dataset and code are available at \url{https://github.com/Dmmm1997/DRL} .
Paper Structure (54 sections, 5 equations, 14 figures, 10 tables, 1 algorithm)

This paper contains 54 sections, 5 equations, 14 figures, 10 tables, 1 algorithm.

Figures (14)

  • Figure 1: A comparison between (a) image retrieval and (b) the proposed DRL framework for UAV self-localization. In IR-based methods, the features of different satellite images are spatially isolated and do not interact with UAV image features. Additionally, this approach requires complex pre- and post-processing operations. The green dot represents the sampling position in the gallery, so the IR-based method inevitably introduces inherent errors caused by sampling. In contrast, the proposed DRL method adopts an end-to-end heterogeneous spatial feature fusion architecture, which simplifies the entire localization process and overcomes the inherent errors.
  • Figure 2: Post-Fusion is a dual-stream architecture that interacts with UAV- and satellite-view features through a Feature Fusion Block. Mix-Fusion is a single-stream network that interacts with feature information in the feature extraction part.
  • Figure 3: This diagram illustrates some detailed modules in the DRL architecture (Fig. \ref{['figure_network']}). (a) is the Attention Fusion Block part of the Mix-Fusion structure. The Multi-Scale Module represented by (b) can be applied to both architectures, and two types of structures are used here to implement it. (c) is the Feature Fusion Block part of the Post-Fusion structure, which contains 3 different types to fuse heterogeneous features.
  • Figure 4: The process of the RSC augmentation method. The red pentagon is the position of the UAV. The light blue dot is the randomly generated center position of the image. The orange and blue dotted boxes correspond to 2 cropped images of different scales.
  • Figure 5: Schematic diagram of selecting positive samples according to R.
  • ...and 9 more figures