Table of Contents
Fetching ...

Style Alignment based Dynamic Observation Method for UAV-View Geo-localization

Jie Shao, LingHao Jiang

TL;DR

The paper tackles UAV-view geo-localization by bridging drone-to-satellite gaps through a style-alignment preprocessor (SAS) and a dynamic observation module grounded in hierarchical attention (HAB) and dual square-ring partitioning. A Deconstruction loss complements standard classification losses to suppress cross-tag correlations and tighten intra-class clusters. Empirical results on University-1652 and SUES-200 show state-of-the-art performance with a compact model, and ablations confirm the effectiveness of SAS, HAB, Gem pooling, and the Dc loss. The approach offers a practical, training-light solution with robust cross-domain transfer, making it valuable for real-world UAV localization and retrieval tasks.

Abstract

The task of UAV-view geo-localization is to estimate the localization of a query satellite/drone image by matching it against a reference dataset consisting of drone/satellite images. Though tremendous strides have been made in feature alignment between satellite and drone views, vast differences in both inter and intra-class due to changes in viewpoint, altitude, and lighting remain a huge challenge. In this paper, a style alignment based dynamic observation method for UAV-view geo-localization is proposed to meet the above challenges from two perspectives: visual style transformation and surrounding noise control. Specifically, we introduce a style alignment strategy to transfrom the diverse visual style of drone-view images into a unified satellite images visual style. Then a dynamic observation module is designed to evaluate the spatial distribution of images by mimicking human observation habits. It is featured by the hierarchical attention block (HAB) with a dual-square-ring stream structure, to reduce surrounding noise and geographical deformation. In addition, we propose a deconstruction loss to push away features of different geo-tags and squeeze knowledge from unmatched images by correlation calculation. The experimental results demonstrate the state-of-the-art performance of our model on benchmarked datasets. In particular, when compared to the prior art on University-1652, our results surpass the best of them (FSRA), while only requiring 2x fewer parameters. Code will be released at https://github.com/Xcco1/SA\_DOM

Style Alignment based Dynamic Observation Method for UAV-View Geo-localization

TL;DR

The paper tackles UAV-view geo-localization by bridging drone-to-satellite gaps through a style-alignment preprocessor (SAS) and a dynamic observation module grounded in hierarchical attention (HAB) and dual square-ring partitioning. A Deconstruction loss complements standard classification losses to suppress cross-tag correlations and tighten intra-class clusters. Empirical results on University-1652 and SUES-200 show state-of-the-art performance with a compact model, and ablations confirm the effectiveness of SAS, HAB, Gem pooling, and the Dc loss. The approach offers a practical, training-light solution with robust cross-domain transfer, making it valuable for real-world UAV localization and retrieval tasks.

Abstract

The task of UAV-view geo-localization is to estimate the localization of a query satellite/drone image by matching it against a reference dataset consisting of drone/satellite images. Though tremendous strides have been made in feature alignment between satellite and drone views, vast differences in both inter and intra-class due to changes in viewpoint, altitude, and lighting remain a huge challenge. In this paper, a style alignment based dynamic observation method for UAV-view geo-localization is proposed to meet the above challenges from two perspectives: visual style transformation and surrounding noise control. Specifically, we introduce a style alignment strategy to transfrom the diverse visual style of drone-view images into a unified satellite images visual style. Then a dynamic observation module is designed to evaluate the spatial distribution of images by mimicking human observation habits. It is featured by the hierarchical attention block (HAB) with a dual-square-ring stream structure, to reduce surrounding noise and geographical deformation. In addition, we propose a deconstruction loss to push away features of different geo-tags and squeeze knowledge from unmatched images by correlation calculation. The experimental results demonstrate the state-of-the-art performance of our model on benchmarked datasets. In particular, when compared to the prior art on University-1652, our results surpass the best of them (FSRA), while only requiring 2x fewer parameters. Code will be released at https://github.com/Xcco1/SA\_DOM
Paper Structure (19 sections, 8 equations, 13 figures, 10 tables)

This paper contains 19 sections, 8 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: The framework of the proposed method. It is composed of part A, part B, and the classifier. Part A is the preprocessing stage which aligns the visual styles of the two view images by SAS. Then satellite-view images and preprocessed drone-view images are put into part B with two separate network branches. The two branches are of the same structure and share weights.
  • Figure 2: Examples of the SAS results. The results of University-1652 are on the left side and the other is the results of Sues-200. (a) and (d) are the original drone-view images, (b) and (e) are results of the SAS, whose visual styles are similar to the satellite images, and (c) and (f) are the satellite view images.
  • Figure 3: The transformation map of the whole train datasets, three curves represent the RGB channels respectively.
  • Figure 4: Illustration of the four square-ring partition (top) and our dual square-ring partition (bottom) by an example pair of images. In both two cases, there are a pair of the satellite-view image and the drone-view image shown in the first and the second column respectively. In the top row, the architecture in the red box is partitioned into two parts in the satellite view, but in the drone view, it is in one part, which would cause mismatching by the model.
  • Figure 5: The architecture of the hierarchical attention block (HAB), which is fabricated in a dual-square-ring stream network.
  • ...and 8 more figures