Table of Contents
Fetching ...

Locatability-Guided Adaptive Reasoning for Image Geo-Localization with Vision-Language Models

Bo Yu, Fengze Yang, Yiming Liu, Chao Wang, Xuewen Luo, Taozhe Li, Ruimin Ke, Xiaofan Zhou, Chenxi Liu

Abstract

The emergence of Vision-Language Models (VLMs) has introduced new paradigms for global image geo-localization through retrieval-augmented generation (RAG) and reasoning-driven inference. However, RAG methods are constrained by retrieval database quality, while reasoning-driven approaches fail to internalize image locatability, relying on inefficient, fixed-depth reasoning paths that increase hallucinations and degrade accuracy. To overcome these limitations, we introduce an Optimized Locatability Score that quantifies an image's suitability for deep reasoning in geo-localization. Using this metric, we curate Geo-ADAPT-51K, a locatability-stratified reasoning dataset enriched with augmented reasoning trajectories for complex visual scenes. Building on this foundation, we propose a two-stage Group Relative Policy Optimization (GRPO) curriculum with customized reward functions that regulate adaptive reasoning depth, visual grounding, and hierarchical geographical accuracy. Our framework, Geo-ADAPT, learns an adaptive reasoning policy, achieves state-of-the-art performance across multiple geo-localization benchmarks, and substantially reduces hallucinations by reasoning both adaptively and efficiently.

Locatability-Guided Adaptive Reasoning for Image Geo-Localization with Vision-Language Models

Abstract

The emergence of Vision-Language Models (VLMs) has introduced new paradigms for global image geo-localization through retrieval-augmented generation (RAG) and reasoning-driven inference. However, RAG methods are constrained by retrieval database quality, while reasoning-driven approaches fail to internalize image locatability, relying on inefficient, fixed-depth reasoning paths that increase hallucinations and degrade accuracy. To overcome these limitations, we introduce an Optimized Locatability Score that quantifies an image's suitability for deep reasoning in geo-localization. Using this metric, we curate Geo-ADAPT-51K, a locatability-stratified reasoning dataset enriched with augmented reasoning trajectories for complex visual scenes. Building on this foundation, we propose a two-stage Group Relative Policy Optimization (GRPO) curriculum with customized reward functions that regulate adaptive reasoning depth, visual grounding, and hierarchical geographical accuracy. Our framework, Geo-ADAPT, learns an adaptive reasoning policy, achieves state-of-the-art performance across multiple geo-localization benchmarks, and substantially reduces hallucinations by reasoning both adaptively and efficiently.
Paper Structure (25 sections, 9 equations, 5 figures, 6 tables)

This paper contains 25 sections, 9 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of locatability-guided adaptive reasoning. Current reasoning VLMs for image geo-localization are trained on fixed chain of thoughts, generating predetermined reasoning trajectories step by step without considering the actual image locatability during inference. In our work, we propose Geo-ADAPT, an adaptive reasoning framework that dynamically adjusts reasoning depth based on the image's Optimized Locatability Score $L_{opt}$.
  • Figure 2: Overview of Geo-ADAPT framework. Our approach comprises three components: (1) an Optimized Locatability Score $L_{opt}$ quantifying reasoning feasibility, (2) locatability-stratified dataset curation Geo-ADAPT-51K with enriched implicit reasoning trajectories, and (3) two-stage GRPO training with adaptive depth, grounding, and hierarchical accuracy rewards for dynamic reasoning allocation.
  • Figure 3: Motivation for Optimized Locatability Score $L_{opt}$ and locatability-guided adaptive reasoning strategies. (a) RAG retrieves visually similar candidates but fails due to being unable to convert implicit visual cues to rich semantic priors. (b) Standard reasoning initially fails similarly, but integrating implicit cues from retrieved image candidates into deep reasoning enables correct localization. Images where $d_{Reason} > d_{RAG} + \tau_{margin}$ receive lower $L_{opt}$, triggering adaptive deep reasoning.
  • Figure 4: Qualitative comparison on a weakly localizable sample ($L_{opt}=0.42$). Facing visual ambiguity, Geo-ADAPT adaptively expands its reasoning depth (202 tokens), conducting a detailed forensic scan to extract the address "17510" and localize the exact shopping center. In contrast, the baseline generates a longer but less grounded response (318 tokens) and fails to identify the city.
  • Figure 5: Qualitative comparison on a highly localizable sample ($L_{opt}=0.67$). Recognizing the strong explicit cues (Olympic Rings), Geo-ADAPT adaptively shortens its reasoning path for efficiency (156 tokens), rapidly converging on the correct location. This contrasts with the baseline, which engages in verbose description (329 tokens) without improving precision.