Table of Contents
Fetching ...

GLEAM: Learning to Match and Explain in Cross-View Geo-Localization

Xudong Lu, Zhi Zheng, Yi Wan, Yongxiang Yao, Annan Wang, Renrui Zhang, Panwang Xia, Qiong Wu, Qingyun Li, Weifeng Lin, Xiangyu Zhao, Peifeng Ma, Xue Yang, Hongsheng Li

TL;DR

This work tackles cross-view geo-localization by unifying multiple views and modalities through satellite-aligned representations (GLEAM-C) and introducing an explainable cross-view reasoning benchmark (GLEAM-X). A two-phase training regime and distributed training yield competitive accuracy and substantial efficiency gains, while a bilingual, explainable dataset enables training and evaluation of interpretable correspondences via multimodal language models. The integrated GLEAM pipeline combines robust retrieval with human-friendly explanations, boosting robustness and transparency for real-world geo-localization on platforms from drones to vehicles. By marrying cross-view matching with explainable reasoning, the paper advances practical, interpretable CVGL with potential impact on autonomous navigation and disaster response.

Abstract

Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location. However, existing CVGL approaches are typically restricted to a single view or modality, and their direct visual matching strategy lacks interpretability: they only determine whether two images correspond, without explaining the rationale behind the match. In this paper, we present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities by aligning them exclusively with satellite imagery. Our framework improves training efficiency through optimized implementation and achieves accuracy comparable to prior modality-specific CVGL models via a novel two-phase training strategy. To address interpretability, we further propose GLEAM-X, a novel task that combines cross-view correspondence prediction with explainable reasoning enabled by multimodal large language models (MLLMs). We construct a bilingual benchmark using commercial MLLMs to generate training and testing data, and refine the test set through rigorous human revision for systematic evaluation of explainable cross-view reasoning. Together, GLEAM-C and GLEAM-X form a comprehensive CVGL pipeline that integrates multi-modal, multi-view alignment with interpretable correspondence analysis, unifying accurate cross-view matching with explainable reasoning and advancing Geo-Localization by enabling models to better Explain And Match. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/GLEAM.

GLEAM: Learning to Match and Explain in Cross-View Geo-Localization

TL;DR

This work tackles cross-view geo-localization by unifying multiple views and modalities through satellite-aligned representations (GLEAM-C) and introducing an explainable cross-view reasoning benchmark (GLEAM-X). A two-phase training regime and distributed training yield competitive accuracy and substantial efficiency gains, while a bilingual, explainable dataset enables training and evaluation of interpretable correspondences via multimodal language models. The integrated GLEAM pipeline combines robust retrieval with human-friendly explanations, boosting robustness and transparency for real-world geo-localization on platforms from drones to vehicles. By marrying cross-view matching with explainable reasoning, the paper advances practical, interpretable CVGL with potential impact on autonomous navigation and disaster response.

Abstract

Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location. However, existing CVGL approaches are typically restricted to a single view or modality, and their direct visual matching strategy lacks interpretability: they only determine whether two images correspond, without explaining the rationale behind the match. In this paper, we present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities by aligning them exclusively with satellite imagery. Our framework improves training efficiency through optimized implementation and achieves accuracy comparable to prior modality-specific CVGL models via a novel two-phase training strategy. To address interpretability, we further propose GLEAM-X, a novel task that combines cross-view correspondence prediction with explainable reasoning enabled by multimodal large language models (MLLMs). We construct a bilingual benchmark using commercial MLLMs to generate training and testing data, and refine the test set through rigorous human revision for systematic evaluation of explainable cross-view reasoning. Together, GLEAM-C and GLEAM-X form a comprehensive CVGL pipeline that integrates multi-modal, multi-view alignment with interpretable correspondence analysis, unifying accurate cross-view matching with explainable reasoning and advancing Geo-Localization by enabling models to better Explain And Match. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/GLEAM.

Paper Structure

This paper contains 40 sections, 5 equations, 6 figures, 14 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of GLEAM-Core and GLEAM-eXplain. (A) GLEAM-C: a novel foundation CVGL model trained to align multiple views and modalities with satellite imagery, including UAV imagery, street maps, panoramic images, and ground photos. (B) GLEAM-X: a novel benchmark combining cross-view correspondence prediction with explainable reasoning. We illustrate a representative example between query (street map) images and satellite images.
  • Figure 2: Method overview of GLEAM-C and GLEAM-X. (A) GLEAM-C: We design a novel two-phase contrastive learning paradigm to train the CVGL model across UAV, street map, panoramic, and ground photographs. (B) GLEAM-X: This component formulates a novel multi-image reasoning task in CVGL. The MLLM receives a query image, a reference image, and a natural language instruction. Through fine-tuning, it delivers both a matching prediction and an interpretable textual explanation.
  • Figure A.1: Street map and ground photograph samples.
  • Figure A.2: Panoramic view and UAV imagery samples.
  • Figure A.3: Sample on the VIGOR test set (Chinese scenario). The gray English text is a direct translation of the Chinese response.
  • ...and 1 more figures