Table of Contents
Fetching ...

ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Li Mi, Chang Xu, Javiera Castillo-Navarro, Syrielle Montariol, Wen Yang, Antoine Bosselut, Devis Tuia

TL;DR

ConGeo tackles cross-view geo-localization under unknown ground-view orientation and limited FoV by introducing a model-agnostic contrastive learning framework. It uses complementary single-view and cross-view losses to align ground-view variations with their original representations and with aerial references, enabling a single model to handle diverse ground-view configurations. Across four CVGL benchmarks and multiple base architectures, ConGeo delivers substantial improvements over orientation- or FoV-specific methods and demonstrates robustness to unseen variations. Analyses show that ConGeo reduces reliance on geometric shortcuts and emphasizes semantically consistent features, boosting practical applicability in real-world navigation and localization tasks.

Abstract

Cross-view geo-localization aims at localizing a ground-level query image by matching it to its corresponding geo-referenced aerial view. In real-world scenarios, the task requires accommodating diverse ground images captured by users with varying orientations and reduced field of views (FoVs). However, existing learning pipelines are orientation-specific or FoV-specific, demanding separate model training for different ground view variations. Such models heavily depend on the North-aligned spatial correspondence and predefined FoVs in the training data, compromising their robustness across different settings. To tackle this challenge, we propose ConGeo, a single- and cross-view Contrastive method for Geo-localization: it enhances robustness and consistency in feature representations to improve a model's invariance to orientation and its resilience to FoV variations, by enforcing proximity between ground view variations of the same location. As a generic learning objective for cross-view geo-localization, when integrated into state-of-the-art pipelines, ConGeo significantly boosts the performance of three base models on four geo-localization benchmarks for diverse ground view variations and outperforms competing methods that train separate models for each ground view variation.

ConGeo: Robust Cross-view Geo-localization across Ground View Variations

TL;DR

ConGeo tackles cross-view geo-localization under unknown ground-view orientation and limited FoV by introducing a model-agnostic contrastive learning framework. It uses complementary single-view and cross-view losses to align ground-view variations with their original representations and with aerial references, enabling a single model to handle diverse ground-view configurations. Across four CVGL benchmarks and multiple base architectures, ConGeo delivers substantial improvements over orientation- or FoV-specific methods and demonstrates robustness to unseen variations. Analyses show that ConGeo reduces reliance on geometric shortcuts and emphasizes semantically consistent features, boosting practical applicability in real-world navigation and localization tasks.

Abstract

Cross-view geo-localization aims at localizing a ground-level query image by matching it to its corresponding geo-referenced aerial view. In real-world scenarios, the task requires accommodating diverse ground images captured by users with varying orientations and reduced field of views (FoVs). However, existing learning pipelines are orientation-specific or FoV-specific, demanding separate model training for different ground view variations. Such models heavily depend on the North-aligned spatial correspondence and predefined FoVs in the training data, compromising their robustness across different settings. To tackle this challenge, we propose ConGeo, a single- and cross-view Contrastive method for Geo-localization: it enhances robustness and consistency in feature representations to improve a model's invariance to orientation and its resilience to FoV variations, by enforcing proximity between ground view variations of the same location. As a generic learning objective for cross-view geo-localization, when integrated into state-of-the-art pipelines, ConGeo significantly boosts the performance of three base models on four geo-localization benchmarks for diverse ground view variations and outperforms competing methods that train separate models for each ground view variation.
Paper Structure (41 sections, 6 equations, 11 figures, 19 tables)

This paper contains 41 sections, 6 equations, 11 figures, 19 tables.

Figures (11)

  • Figure 1: ConGeo boosts the robustness across ground view variations: North-aligned, unknown orientation (FoV=360$^{\circ}$) and limited field of views (FoV=70$^{\circ}$, 90$^{\circ}$, and 180$^{\circ}$). We compare with SEH guo2022softSEH, DSM jointloc_2020_cvpr and SAIG-D zhu2023simplesaigd and report Top-1 Recall on the CVUSA cvusa_cvpr_2015 dataset, one of the geo-localization benchmarks.
  • Figure 2: ConGeo's learning pipeline. For feature representation in the left and right boxes, the North-aligned ground image ($I_q$), the transformed ground image ($I^{*}_q$), and the aerial view ($I_r$) are sent to their respective encoders. Then in the feature space, the single- and cross-view contrastive learning losses are applied to enforce the proximity of the paired images.
  • Figure 2: Comparison of the North-aligned setting on CVUSA and CVACT datasets. The second-best performance is underlined. "-" means the score is not provided in the original paper.
  • Figure 3: Examples of the top-4 retrieved images from ConGeo and the baseline when FoV=90$^{\circ}$. Images in the orange box denote the correct results.
  • Figure 4: ConGeo shows better orientation invariance. We cyclically shift the ground view with an angle (x-axis) as the model's input to test its retrieval performance. Note that "N-A" denotes the North-aligned setting and "DA" means data augmentation.
  • ...and 6 more figures