Simple, Effective and General: A New Backbone for Cross-view Image Geo-localization

Yingying Zhu; Hongji Yang; Yuxin Lu; Qiang Huang

Simple, Effective and General: A New Backbone for Cross-view Image Geo-localization

Yingying Zhu, Hongji Yang, Yuxin Lu, Qiang Huang

TL;DR

SAIG introduces a simple, general backbone for cross-view image geo-localization that eschews heavy feature-aggregation modules. It combines an overlapping convolutional stem with multi-head self-attention to model global patch relations and a lightweight Spatial-Mixed Feature Aggregation module to fuse spatial cues, achieving state-of-the-art or competitive results across multiple benchmarks with far fewer parameters. The approach demonstrates robust generalization to image retrieval tasks, supported by comprehensive ablations and visualizations. Together, SAIG and SMD offer a practical, scalable solution for cross-view localization in real-world settings.

Abstract

In this work, we aim at an important but less explored problem of a simple yet effective backbone specific for cross-view geo-localization task. Existing methods for cross-view geo-localization tasks are frequently characterized by 1) complicated methodologies, 2) GPU-consuming computations, and 3) a stringent assumption that aerial and ground images are centrally or orientation aligned. To address the above three challenges for cross-view image matching, we propose a new backbone network, named Simple Attention-based Image Geo-localization network (SAIG). The proposed SAIG effectively represents long-range interactions among patches as well as cross-view correspondence with multi-head self-attention layers. The "narrow-deep" architecture of our SAIG improves the feature richness without degradation in performance, while its shallow and effective convolutional stem preserves the locality, eliminating the loss of patchify boundary information. Our SAIG achieves state-of-the-art results on cross-view geo-localization, while being far simpler than previous works. Furthermore, with only 15.9% of the model parameters and half of the output dimension compared to the state-of-the-art, the SAIG adapts well across multiple cross-view datasets without employing any well-designed feature aggregation modules or feature alignment algorithms. In addition, our SAIG attains competitive scores on image retrieval benchmarks, further demonstrating its generalizability. As a backbone network, our SAIG is both easy to follow and computationally lightweight, which is meaningful in practical scenario. Moreover, we propose a simple Spatial-Mixed feature aggregation moDule (SMD) that can mix and project spatial information into a low-dimensional space to generate feature descriptors... (The code is available at https://github.com/yanghongji2007/SAIG)

Simple, Effective and General: A New Backbone for Cross-view Image Geo-localization

TL;DR

Abstract

Paper Structure (26 sections, 10 equations, 9 figures, 11 tables)

This paper contains 26 sections, 10 equations, 9 figures, 11 tables.

Introduction
Related Work
CNN-based method
Transformer-based method
Feature Aggregation
Retrieval loss
Methodology
Backbone Overview
Conv Stem
Multi-head Self-attention
Spatial-mixed Feature Aggregation Module
SAIG Variants
Loss Function
Experiment
Dataset and Metric
...and 11 more sections

Figures (9)

Figure 1: Illustration of center-aligned image pairs. Existing studies mainly rely on a strong assumption that the query ground image must be exactly centered at the location of the aerial image. Thus, for one-to-one matching, the red border is considered as a correct match, while the green border is an incorrect match. In contrast, for one-to-many matching, both the red and green borders are correct matches.
Figure 2: Overall structure of SAIG. The network applies a Siamese-like architecture (no weight-shared) for extracting features from the two views. The convolutional stem captures some low-level features of each input and then projects each pixel to obtain the "$Patch\times Channel$" patch-based representation. These patches are further fed into the stacked SA layers and finally processed by global average pooling. Bottom left: A convolutional stem contains six layers of $3\times3$ convolution with Batch normalization and ReLU non-linearity. Bottom right: An SA layer contains layer norm, a self-attention module, and a linear projection, building the global relationship among patches.
Figure 3: Spatial-mixed feature aggregation module
Figure 4: Visualization of training curve (r@1) on (a) CVUSA, (b) CVACT and (c) VIGOR. The blue lines show the curve of the SAIG-D backbone training with GAP and the red lines show the the curve of the SAIG-D network training with our SMD. Compared to GAP, our SMD significantly improves the performance and saturation rate of SAIG.
Figure 5: Cross-view geo-localizaion accuracy versus the number of K in SMD training with SAIG-D. Note that the log scale of the x-axis.
...and 4 more figures

Simple, Effective and General: A New Backbone for Cross-view Image Geo-localization

TL;DR

Abstract

Simple, Effective and General: A New Backbone for Cross-view Image Geo-localization

Authors

TL;DR

Abstract

Table of Contents

Figures (9)