Table of Contents
Fetching ...

GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

Ling Li, Yu Ye, Yao Zhou, Bingchuan Jiang, Wei Zeng

TL;DR

This paper tackles street-view geo-localization by addressing data quality and reasoning gaps in LVLMs. It introduces locatability as a quantitative filter to curate a high-quality training set (over 70,000 GSV images) and leverages external human-inference knowledge from geo-localization games to train a two-stage LoRA-finetuned LVLM, GeoReasoner, with reasoning and location tuning. GeoReasoner demonstrates substantial gains over LVLM baselines (more than 25% country-level and 38% city-level) and remains competitive with StreetCLIP while using far less data, also exhibiting strong generalizability to open datasets. The work advances geo-localization by fusing LVLMs with human reasoning and data curation, offering improved interpretability and practical impact for navigation and urban analysis.

Abstract

This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.

GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

TL;DR

This paper tackles street-view geo-localization by addressing data quality and reasoning gaps in LVLMs. It introduces locatability as a quantitative filter to curate a high-quality training set (over 70,000 GSV images) and leverages external human-inference knowledge from geo-localization games to train a two-stage LoRA-finetuned LVLM, GeoReasoner, with reasoning and location tuning. GeoReasoner demonstrates substantial gains over LVLM baselines (more than 25% country-level and 38% city-level) and remains competitive with StreetCLIP while using far less data, also exhibiting strong generalizability to open datasets. The work advances geo-localization by fusing LVLMs with human reasoning and data curation, offering improved interpretability and practical impact for navigation and urban analysis.

Abstract

This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.

Paper Structure

This paper contains 21 sections, 1 equation, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Different paradigms in existing and the proposed geo-localization approaches: retrieval-based (left-top), classification-based (left-bottom), and our LVLM-based (right).
  • Figure 2: The locatability quantization network devises a CLIP-based visual-text pairing approach to predict the locatability metric.
  • Figure 3: The architecture of GeoReasoner consists of three modules: Vision Encoder, VL Adapter and Pre-trained LLM. The model undergoes a two-fold supervised fine-tuning process: reasoning tuning and location tuning, to enable geo-localization with reasoning.
  • Figure 4: Locatability examples. Top row: the street views are highly locatable by signboards, architectural styles, and landmarks. Bottom row: no visual clues for locating the street views.
  • Figure 5: The relationship between building proportion and the degree of locatability in street views. The locatability metric peaks when the building proportion is approximately 0.2.
  • ...and 4 more figures