Table of Contents
Fetching ...

A Transformer-Based Adaptive Semantic Aggregation Method for UAV Visual Geo-Localization

Shishen Li, Cuiwei Liu, Huaijun Qiu, Zhaokui Li

TL;DR

The paper tackles cross-view UAV visual geo-localization between UAV and satellite images, addressing large viewpoint and scale variations that challenge robust matching. It introduces a transformer-based siamese framework augmented with Adaptive Semantic Aggregation (ASA), a soft-partition mechanism that clusters patch-level features into K semantic parts via learned anchors and patch-to-part attentions, producing global and part-level descriptors. The ASA is integrated with a Vision Transformer backbone and a classifier trained with cross-entropy and triplet losses, yielding a final descriptor for cross-view retrieval. Empirical results on University-1652 show state-of-the-art performance, with ablations confirming the effectiveness of soft partitioning, the two-part setting, and the use of 256×256 inputs, highlighting ASA's potential to enhance UAV geolocation systems.

Abstract

This paper addresses the task of Unmanned Aerial Vehicles (UAV) visual geo-localization, which aims to match images of the same geographic target taken by different platforms, i.e., UAVs and satellites. In general, the key to achieving accurate UAV-satellite image matching lies in extracting visual features that are robust against viewpoint changes, scale variations, and rotations. Current works have shown that part matching is crucial for UAV visual geo-localization since part-level representations can capture image details and help to understand the semantic information of scenes. However, the importance of preserving semantic characteristics in part-level representations is not well discussed. In this paper, we introduce a transformer-based adaptive semantic aggregation method that regards parts as the most representative semantics in an image. Correlations of image patches to different parts are learned in terms of the transformer's feature map. Then our method decomposes part-level features into an adaptive sum of all patch features. By doing this, the learned parts are encouraged to focus on patches with typical semantics. Extensive experiments on the University-1652 dataset have shown the superiority of our method over the current works.

A Transformer-Based Adaptive Semantic Aggregation Method for UAV Visual Geo-Localization

TL;DR

The paper tackles cross-view UAV visual geo-localization between UAV and satellite images, addressing large viewpoint and scale variations that challenge robust matching. It introduces a transformer-based siamese framework augmented with Adaptive Semantic Aggregation (ASA), a soft-partition mechanism that clusters patch-level features into K semantic parts via learned anchors and patch-to-part attentions, producing global and part-level descriptors. The ASA is integrated with a Vision Transformer backbone and a classifier trained with cross-entropy and triplet losses, yielding a final descriptor for cross-view retrieval. Empirical results on University-1652 show state-of-the-art performance, with ablations confirming the effectiveness of soft partitioning, the two-part setting, and the use of 256×256 inputs, highlighting ASA's potential to enhance UAV geolocation systems.

Abstract

This paper addresses the task of Unmanned Aerial Vehicles (UAV) visual geo-localization, which aims to match images of the same geographic target taken by different platforms, i.e., UAVs and satellites. In general, the key to achieving accurate UAV-satellite image matching lies in extracting visual features that are robust against viewpoint changes, scale variations, and rotations. Current works have shown that part matching is crucial for UAV visual geo-localization since part-level representations can capture image details and help to understand the semantic information of scenes. However, the importance of preserving semantic characteristics in part-level representations is not well discussed. In this paper, we introduce a transformer-based adaptive semantic aggregation method that regards parts as the most representative semantics in an image. Correlations of image patches to different parts are learned in terms of the transformer's feature map. Then our method decomposes part-level features into an adaptive sum of all patch features. By doing this, the learned parts are encouraged to focus on patches with typical semantics. Extensive experiments on the University-1652 dataset have shown the superiority of our method over the current works.
Paper Structure (14 sections, 13 equations, 4 figures, 4 tables)

This paper contains 14 sections, 13 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: An UAV-satellite image pair is shown in column(a). The square-ring partition strategy of LPN wang2021each is depicted in column(b). Column(c) illustrates the heat map and the part partition of image patches generated by FSRA dai2021transformer. Red ellipses mark patches that are similar in features but divided into different parts. Attention maps corresponding to two parts produced by the proposed ASA module are given in column(d).
  • Figure 2: Overall framework of our method.
  • Figure 3: Architecture of Vision Transformer (ViT).
  • Figure 4: Architecture of the classification module. In training, global and part-level features are fed into additive layers followed by classification layers. Suppose that the training data come from 701 locations, so a classification layer predicts a 701-dimensional vector. The model is optimized by CE loss and Triplet loss. Green and purple lines point at positive and negative samples for calculating the Triplet loss, respectively.