Table of Contents
Fetching ...

Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models

Ling Li, Yao Zhou, Yuxuan Liang, Fugee Tsung, Jiaheng Wei

TL;DR

The paper tackles image geo-localization with LVLMs by addressing data diversity and reasoning-driven training. It introduces MP16-Reason, a diverse, reasoning-annotated dataset, and GLOBE, a GRPO-based LVLM fine-tuning framework that jointly enhances localizability, visual grounding, and geolocation accuracy. Through extensive experiments, GLOBE demonstrates data-efficient, interpretable reasoning and strong performance against open-source baselines, with notable generalization to unseen domains. The approach presents a practical, open-source path toward more reliable and explainable multimodal geo-localization, while outlining future directions for coordinate-level precision and broader reasoning tasks.

Abstract

Previous methods for image geo-localization have typically treated the task as either classification or retrieval, often relying on black-box decisions that lack interpretability. The rise of large vision-language models (LVLMs) has enabled a rethinking of geo-localization as a reasoning-driven task grounded in visual cues. However, two major challenges persist. On the data side, existing reasoning-focused datasets are primarily based on street-view imagery, offering limited scene diversity and constrained viewpoints. On the modeling side, current approaches predominantly rely on supervised fine-tuning, which yields only marginal improvements in reasoning capabilities. To address these challenges, we propose a novel pipeline that constructs a reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images. We introduce GLOBE, Group-relative policy optimization for Localizability assessment and Optimized visual-cue reasoning, yielding Bi-objective geo-Enhancement for the VLM in recognition and reasoning. GLOBE incorporates task-specific rewards that jointly enhance localizability assessment, visual-cue reasoning, and geolocation accuracy. Both qualitative and quantitative results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks, particularly in diverse visual scenes, while also generating more insightful and interpretable reasoning trajectories. The data and code are available at https://github.com/lingli1996/GLOBE.

Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models

TL;DR

The paper tackles image geo-localization with LVLMs by addressing data diversity and reasoning-driven training. It introduces MP16-Reason, a diverse, reasoning-annotated dataset, and GLOBE, a GRPO-based LVLM fine-tuning framework that jointly enhances localizability, visual grounding, and geolocation accuracy. Through extensive experiments, GLOBE demonstrates data-efficient, interpretable reasoning and strong performance against open-source baselines, with notable generalization to unseen domains. The approach presents a practical, open-source path toward more reliable and explainable multimodal geo-localization, while outlining future directions for coordinate-level precision and broader reasoning tasks.

Abstract

Previous methods for image geo-localization have typically treated the task as either classification or retrieval, often relying on black-box decisions that lack interpretability. The rise of large vision-language models (LVLMs) has enabled a rethinking of geo-localization as a reasoning-driven task grounded in visual cues. However, two major challenges persist. On the data side, existing reasoning-focused datasets are primarily based on street-view imagery, offering limited scene diversity and constrained viewpoints. On the modeling side, current approaches predominantly rely on supervised fine-tuning, which yields only marginal improvements in reasoning capabilities. To address these challenges, we propose a novel pipeline that constructs a reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images. We introduce GLOBE, Group-relative policy optimization for Localizability assessment and Optimized visual-cue reasoning, yielding Bi-objective geo-Enhancement for the VLM in recognition and reasoning. GLOBE incorporates task-specific rewards that jointly enhance localizability assessment, visual-cue reasoning, and geolocation accuracy. Both qualitative and quantitative results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks, particularly in diverse visual scenes, while also generating more insightful and interpretable reasoning trajectories. The data and code are available at https://github.com/lingli1996/GLOBE.

Paper Structure

This paper contains 27 sections, 6 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Overview of data and modeling limitations in LVLM-based image geo-localization.
  • Figure 2: Example reasoning trajectories generated by GLOBE, illustrating interpretable and visually grounded geolocation predictions.
  • Figure 3: The pipeline of data synthesis and curation via multi-model distillation and verification.
  • Figure 4: GRPO optimization framework with multi-dimensional reward design. For each prompt, candidate outputs are scored using three task-specific reward models: $R_{\text{loc}}$, $R_{\text{vis}}$, and $R_{\text{geo}}$, which reflect different aspects of geo-localization reasoning. Group-wise advantage values guide policy updates, while a $\mathcal{D}_{\text{KL}}$ penalty constrains divergence from the reference model.
  • Figure 5: Reasoning comparison of four different models (GPT-4.1 gpt4-1, GLOBE, Qwen2.5-VL-7B Qwen2.5vl with SFT, and InternVL3-78B zhu2025internvl3) on the same input image. Reliable visual cues identified by the models are marked in text.
  • ...and 1 more figures