Table of Contents
Fetching ...

Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization

Tao Liu, Kan Ren, Qian Chen

TL;DR

This work tackles cross-view UAV localization in GNSS-denied environments by reframing matching as graph-based relational reasoning over object-detected regions. It combines a dual-graph representation (spatial and semantic) with a Graph Attention Network to learn UAV-to-satellite correspondences, optimized via multi-task losses including graph-node matching, embedding, and scene-classification. The approach demonstrates strong cross-view and cross-modal performance on public and infrared-visible datasets, with ablations confirming the value of semantic cues, global features, and dynamic loss weighting. The practical impact lies in robust, efficient localization across time, viewpoint, and modality gaps, with publicly available infrared datasets to support future research and evaluation.

Abstract

With the rapid growth of the low-altitude economy, UAVs have become crucial for measurement and tracking in patrol systems. However, in GNSS-denied areas, satellite-based localization methods are prone to failure. This paper presents a cross-view UAV localization framework that performs map matching via object detection, aimed at effectively addressing cross-temporal, cross-view, heterogeneous aerial image matching. In typical pipelines, UAV visual localization is formulated as an image-retrieval problem: features are extracted to build a localization map, and the pose of a query image is estimated by matching it to a reference database with known poses. Because publicly available UAV localization datasets are limited, many approaches recast localization as a classification task and rely on scene labels in these datasets to ensure accuracy. Other methods seek to reduce cross-domain differences using polar-coordinate reprojection, perspective transformations, or generative adversarial networks; however, they can suffer from misalignment, content loss, and limited realism. In contrast, we leverage modern object detection to accurately extract salient instances from UAV and satellite images, and integrate a graph neural network to reason about inter-image and intra-image node relationships. Using a fine-grained, graph-based node-similarity metric, our method achieves strong retrieval and localization performance. Extensive experiments on public and real-world datasets show that our approach handles heterogeneous appearance differences effectively and generalizes well, making it applicable to scenarios with larger modality gaps, such as infrared-visible image matching. Our dataset will be publicly available at the following URL: https://github.com/liutao23/ODGNNLoc.git.

Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization

TL;DR

This work tackles cross-view UAV localization in GNSS-denied environments by reframing matching as graph-based relational reasoning over object-detected regions. It combines a dual-graph representation (spatial and semantic) with a Graph Attention Network to learn UAV-to-satellite correspondences, optimized via multi-task losses including graph-node matching, embedding, and scene-classification. The approach demonstrates strong cross-view and cross-modal performance on public and infrared-visible datasets, with ablations confirming the value of semantic cues, global features, and dynamic loss weighting. The practical impact lies in robust, efficient localization across time, viewpoint, and modality gaps, with publicly available infrared datasets to support future research and evaluation.

Abstract

With the rapid growth of the low-altitude economy, UAVs have become crucial for measurement and tracking in patrol systems. However, in GNSS-denied areas, satellite-based localization methods are prone to failure. This paper presents a cross-view UAV localization framework that performs map matching via object detection, aimed at effectively addressing cross-temporal, cross-view, heterogeneous aerial image matching. In typical pipelines, UAV visual localization is formulated as an image-retrieval problem: features are extracted to build a localization map, and the pose of a query image is estimated by matching it to a reference database with known poses. Because publicly available UAV localization datasets are limited, many approaches recast localization as a classification task and rely on scene labels in these datasets to ensure accuracy. Other methods seek to reduce cross-domain differences using polar-coordinate reprojection, perspective transformations, or generative adversarial networks; however, they can suffer from misalignment, content loss, and limited realism. In contrast, we leverage modern object detection to accurately extract salient instances from UAV and satellite images, and integrate a graph neural network to reason about inter-image and intra-image node relationships. Using a fine-grained, graph-based node-similarity metric, our method achieves strong retrieval and localization performance. Extensive experiments on public and real-world datasets show that our approach handles heterogeneous appearance differences effectively and generalizes well, making it applicable to scenarios with larger modality gaps, such as infrared-visible image matching. Our dataset will be publicly available at the following URL: https://github.com/liutao23/ODGNNLoc.git.

Paper Structure

This paper contains 44 sections, 30 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: This shows typical examples from top to bottom: the publicly available virtual dataset University-1652 r16, the publicly available real-world datasets SUES-200 r17, DenseUAV r18, and our own collected real infrared and visible light drone images (the first three columns on the left) and satellite images (the rightmost column). Notable differences include seasonal changes in vegetation, shadow angles, perspectives of buildings, presence of vehicles, and viewpoint and radiometric differences caused by different imaging hardware platforms. Our objective is to overcome these discrepancies to achieve accurate matching and retrieval of drone and satellite images for precise localization.
  • Figure 2: In the task of cross-view image matching, we employ the advanced LightGlue r44 to match images captured by UAVs with satellite images. The top row demonstrates the matching results between visible light images from UAVs and satellites, where significant differences in viewpoint and scale lead to numerous mismatches. The bottom row presents the matching results between infrared and visible light images, where the matching completely fails due to substantial viewpoint differences and notable visual appearance disparities caused by different radiation sources.
  • Figure 3: The matching process of the proposed method consists of four main steps. (1) The input images (including drone and satellite views) are processed using a Faster R-CNN or YOLOv8 algorithm with top-down attention to extract salient region features. (2) Spatial and semantic graphs are used to construct the drone and satellite visual graphs. (3) A Graph Neural Network (GNN) is employed to infer the latent relationships within and between graph nodes, followed by aggregation to obtain embedding representations. (4) The graph node similarity and embedding similarity are used as objective functions to train the model. To ensure inference efficiency, the graph node similarity is only used during training to optimize the network and is not utilized during testing.
  • Figure 4: In the task of matching UAV and satellite images, recurring similar patterns can easily lead to matching ambiguities. The target region for UAV localization is annotated with a yellow bounding box, the correctly matched satellite region is marked with a green bounding box, and the incorrectly matched region is indicated with a red bounding box. Although the red incorrect regions and the green correct regions may exhibit high similarity in terms of category (e.g., buildings, vegetation) and visual attributes (e.g., color, texture) to the target region, their spatial distribution (e.g., geographic location) significantly differs from that of the true matching region. Such recurring patterns can cause the network to overly rely on local features while neglecting global context, leading to the misclassification of regions with similar appearances but different spatial distributions as having the same semantic correspondence. To address this, we aim for the network to simultaneously learn semantic similarity (e.g., local feature matching) and spatial distribution differences (e.g., global context awareness) between images. This approach is expected to enhance robustness against recurring patterns and reduce the rate of mismatches.
  • Figure 5: Qualitative Results of Image Retrieval. We present the top two retrieval results for drone view object localization (left) and drone navigation (right), ordered from left to right based on confidence scores. The blue boxes indicate correct matches, while the red boxes represent incorrect matches.
  • ...and 1 more figures