GLEAM: Learning to Match and Explain in Cross-View Geo-Localization

Xudong Lu; Zhi Zheng; Yi Wan; Yongxiang Yao; Annan Wang; Renrui Zhang; Panwang Xia; Qiong Wu; Qingyun Li; Weifeng Lin; Xiangyu Zhao; Peifeng Ma; Xue Yang; Hongsheng Li

GLEAM: Learning to Match and Explain in Cross-View Geo-Localization

Xudong Lu, Zhi Zheng, Yi Wan, Yongxiang Yao, Annan Wang, Renrui Zhang, Panwang Xia, Qiong Wu, Qingyun Li, Weifeng Lin, Xiangyu Zhao, Peifeng Ma, Xue Yang, Hongsheng Li

TL;DR

This work tackles cross-view geo-localization by unifying multiple views and modalities through satellite-aligned representations (GLEAM-C) and introducing an explainable cross-view reasoning benchmark (GLEAM-X). A two-phase training regime and distributed training yield competitive accuracy and substantial efficiency gains, while a bilingual, explainable dataset enables training and evaluation of interpretable correspondences via multimodal language models. The integrated GLEAM pipeline combines robust retrieval with human-friendly explanations, boosting robustness and transparency for real-world geo-localization on platforms from drones to vehicles. By marrying cross-view matching with explainable reasoning, the paper advances practical, interpretable CVGL with potential impact on autonomous navigation and disaster response.

Abstract

Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location. However, existing CVGL approaches are typically restricted to a single view or modality, and their direct visual matching strategy lacks interpretability: they only determine whether two images correspond, without explaining the rationale behind the match. In this paper, we present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities by aligning them exclusively with satellite imagery. Our framework improves training efficiency through optimized implementation and achieves accuracy comparable to prior modality-specific CVGL models via a novel two-phase training strategy. To address interpretability, we further propose GLEAM-X, a novel task that combines cross-view correspondence prediction with explainable reasoning enabled by multimodal large language models (MLLMs). We construct a bilingual benchmark using commercial MLLMs to generate training and testing data, and refine the test set through rigorous human revision for systematic evaluation of explainable cross-view reasoning. Together, GLEAM-C and GLEAM-X form a comprehensive CVGL pipeline that integrates multi-modal, multi-view alignment with interpretable correspondence analysis, unifying accurate cross-view matching with explainable reasoning and advancing Geo-Localization by enabling models to better Explain And Match. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/GLEAM.

GLEAM: Learning to Match and Explain in Cross-View Geo-Localization

TL;DR

Abstract

GLEAM: Learning to Match and Explain in Cross-View Geo-Localization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)