Table of Contents
Fetching ...

VAGeo: View-specific Attention for Cross-View Object Geo-Localization

Zhongyang Li, Xin Yuan, Wei Liu, Xin Xu

TL;DR

This work tackles cross-view object geo-localization (CVOGL), where sharp viewpoint differences between ground/drone queries and satellite references hinder precise object localization. It introduces VAGeo, a two-branch system incorporating view-specific positional encoding (VSPE) for object-level cueing and a channel-spatial hybrid attention (CSHA) module for discriminative feature learning. The approach yields significant gains on the CVOGL benchmark, with ground-view acc@0.25/acc@0.5 rising from 45.43%/42.24% to 48.21%/45.22% and drone-view acc@0.25/acc@0.5 rising from 61.97%/57.66% to 66.19%/61.87%. The results underscore the value of viewpoint-aware encoding and multi-faceted attention in enabling precise cross-view, object-level geo-localization for geospatial analysis.

Abstract

Cross-view object geo-localization (CVOGL) aims to locate an object of interest in a captured ground- or drone-view image within the satellite image. However, existing works treat ground-view and drone-view query images equivalently, overlooking their inherent viewpoint discrepancies and the spatial correlation between the query image and the satellite-view reference image. To this end, this paper proposes a novel View-specific Attention Geo-localization method (VAGeo) for accurate CVOGL. Specifically, VAGeo contains two key modules: view-specific positional encoding (VSPE) module and channel-spatial hybrid attention (CSHA) module. In object-level, according to the characteristics of different viewpoints of ground and drone query images, viewpoint-specific positional codings are designed to more accurately identify the click-point object of the query image in the VSPE module. In feature-level, a hybrid attention in the CSHA module is introduced by combining channel attention and spatial attention mechanisms simultaneously for learning discriminative features. Extensive experimental results demonstrate that the proposed VAGeo gains a significant performance improvement, i.e., improving acc@0.25/acc@0.5 on the CVOGL dataset from 45.43%/42.24% to 48.21%/45.22% for ground-view, and from 61.97%/57.66% to 66.19%/61.87% for drone-view.

VAGeo: View-specific Attention for Cross-View Object Geo-Localization

TL;DR

This work tackles cross-view object geo-localization (CVOGL), where sharp viewpoint differences between ground/drone queries and satellite references hinder precise object localization. It introduces VAGeo, a two-branch system incorporating view-specific positional encoding (VSPE) for object-level cueing and a channel-spatial hybrid attention (CSHA) module for discriminative feature learning. The approach yields significant gains on the CVOGL benchmark, with ground-view acc@0.25/acc@0.5 rising from 45.43%/42.24% to 48.21%/45.22% and drone-view acc@0.25/acc@0.5 rising from 61.97%/57.66% to 66.19%/61.87%. The results underscore the value of viewpoint-aware encoding and multi-faceted attention in enabling precise cross-view, object-level geo-localization for geospatial analysis.

Abstract

Cross-view object geo-localization (CVOGL) aims to locate an object of interest in a captured ground- or drone-view image within the satellite image. However, existing works treat ground-view and drone-view query images equivalently, overlooking their inherent viewpoint discrepancies and the spatial correlation between the query image and the satellite-view reference image. To this end, this paper proposes a novel View-specific Attention Geo-localization method (VAGeo) for accurate CVOGL. Specifically, VAGeo contains two key modules: view-specific positional encoding (VSPE) module and channel-spatial hybrid attention (CSHA) module. In object-level, according to the characteristics of different viewpoints of ground and drone query images, viewpoint-specific positional codings are designed to more accurately identify the click-point object of the query image in the VSPE module. In feature-level, a hybrid attention in the CSHA module is introduced by combining channel attention and spatial attention mechanisms simultaneously for learning discriminative features. Extensive experimental results demonstrate that the proposed VAGeo gains a significant performance improvement, i.e., improving acc@0.25/acc@0.5 on the CVOGL dataset from 45.43%/42.24% to 48.21%/45.22% for ground-view, and from 61.97%/57.66% to 66.19%/61.87% for drone-view.
Paper Structure (9 sections, 3 equations, 6 figures, 3 tables)

This paper contains 9 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (a) Difference of the activation maps generated by DetGeo sun2023cross and our method from ground- and drone-views. (b) The spatial correlation of ground- and drone-views resembles that of satellite images, allowing the surroundings to serve as a discriminative knowledge. Red click points denote target objects in the query image, red boxes indicate target objects in the reference image, and colored triangles represent potential positive contextual information. Best viewed in color.
  • Figure 2: Overall architecture of our proposed VAGeo.
  • Figure 3: The details of our method: (a) VSPE for ground-view, (b) VSPE for drone-view, and (c) CSHA.
  • Figure 4: Ablation study of $\sigma$ in VSPE for ground-view.
  • Figure 5: Visualization of heatmaps for ground- and drone- views. (a) Baseline heatmap. (b) Ours with VSPE heatmap. (c) Ours with VSPE and CSHA heatmap.
  • ...and 1 more figures