GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

Ling Li; Yu Ye; Yao Zhou; Bingchuan Jiang; Wei Zeng

GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

Ling Li, Yu Ye, Yao Zhou, Bingchuan Jiang, Wei Zeng

TL;DR

This paper tackles street-view geo-localization by addressing data quality and reasoning gaps in LVLMs. It introduces locatability as a quantitative filter to curate a high-quality training set (over 70,000 GSV images) and leverages external human-inference knowledge from geo-localization games to train a two-stage LoRA-finetuned LVLM, GeoReasoner, with reasoning and location tuning. GeoReasoner demonstrates substantial gains over LVLM baselines (more than 25% country-level and 38% city-level) and remains competitive with StreetCLIP while using far less data, also exhibiting strong generalizability to open datasets. The work advances geo-localization by fusing LVLMs with human reasoning and data curation, offering improved interpretability and practical impact for navigation and urban analysis.

Abstract

This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.

GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

TL;DR

Abstract

GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)