Table of Contents
Fetching ...

SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning

Furong Jia, Ling Dai, Wenjin Deng, Fan Zhang, Chen Hu, Daxin Jiang, Yu Liu

TL;DR

SpotAgent reframes image geo-localization as an agentic decision process that actively leverages external tools within a ReAct loop to verify visual cues. The method introduces a three-stage post-training pipeline (SFT, Agentic Cold Start, RL) and a Spatially-Aware Dynamic Filtering strategy to efficiently train the agent on challenging, long-tail data. A Multi-Agent ReAct data-generation framework produces high-quality, tool-enabled trajectories, enabling grounded predictions with verifiable coordinates. Empirical results on standard benchmarks show state-of-the-art performance among retrieval-free approaches and substantial reductions in hallucinations, highlighting the practical impact of tool-assisted, agentic reasoning for robust geo-localization in the wild.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated strong reasoning capabilities in geo-localization, yet they often struggle in real-world scenarios where visual cues are sparse, long-tailed, and highly ambiguous. Previous approaches, bound by internal knowledge, often fail to provide verifiable results, yielding confident but ungrounded predictions when faced with confounded evidence. To address these challenges, we propose SpotAgent, a framework that formalizes geo-localization into an agentic reasoning process that leverages expert-level reasoning to synergize visual interpretation with tool-assisted verification. SpotAgent actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram. We introduce a 3-stage post-training pipeline starting with a Supervised Fine-Tuning (SFT) stage for basic alignment, followed by an Agentic Cold Start phase utilizing high-quality trajectories synthesized via a Multi-Agent framework, aiming to instill tool-calling expertise. Subsequently, the model's reasoning capabilities are refined through Reinforcement Learning. We propose a Spatially-Aware Dynamic Filtering strategy to enhance the efficiency of the RL stage by prioritizing learnable samples based on spatial difficulty. Extensive experiments on standard benchmarks demonstrate that SpotAgent achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.

SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning

TL;DR

SpotAgent reframes image geo-localization as an agentic decision process that actively leverages external tools within a ReAct loop to verify visual cues. The method introduces a three-stage post-training pipeline (SFT, Agentic Cold Start, RL) and a Spatially-Aware Dynamic Filtering strategy to efficiently train the agent on challenging, long-tail data. A Multi-Agent ReAct data-generation framework produces high-quality, tool-enabled trajectories, enabling grounded predictions with verifiable coordinates. Empirical results on standard benchmarks show state-of-the-art performance among retrieval-free approaches and substantial reductions in hallucinations, highlighting the practical impact of tool-assisted, agentic reasoning for robust geo-localization in the wild.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated strong reasoning capabilities in geo-localization, yet they often struggle in real-world scenarios where visual cues are sparse, long-tailed, and highly ambiguous. Previous approaches, bound by internal knowledge, often fail to provide verifiable results, yielding confident but ungrounded predictions when faced with confounded evidence. To address these challenges, we propose SpotAgent, a framework that formalizes geo-localization into an agentic reasoning process that leverages expert-level reasoning to synergize visual interpretation with tool-assisted verification. SpotAgent actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram. We introduce a 3-stage post-training pipeline starting with a Supervised Fine-Tuning (SFT) stage for basic alignment, followed by an Agentic Cold Start phase utilizing high-quality trajectories synthesized via a Multi-Agent framework, aiming to instill tool-calling expertise. Subsequently, the model's reasoning capabilities are refined through Reinforcement Learning. We propose a Spatially-Aware Dynamic Filtering strategy to enhance the efficiency of the RL stage by prioritizing learnable samples based on spatial difficulty. Extensive experiments on standard benchmarks demonstrate that SpotAgent achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.
Paper Structure (55 sections, 11 equations, 11 figures, 9 tables)

This paper contains 55 sections, 11 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Overview of the proposed framework. (a) Inference Pipeline: SpotAgent operates as a ReAct loop, formulating geo-localization as a sequential decision-making process to interact with external tools. (b) Data Generation Pipeline: A Multi-Agent framework is employed for synthesizing high-quality training trajectories.
  • Figure 2: Post-training framework of SpotAgent. (a) The base model is first aligned to the geo-localization task through supervised fine-tuning. (b) Agentic capabilities are then activated via an agentic cold start stage, where the model learns to perform agentic reasoning with tool interactions. (c) A dynamic data filtering strategy selects learnable samples to improve RL optimization efficiency. (d) The model's reasoning capabilities is further refined through RL using the filtered dataset.
  • Figure 3: Reasoning comparison: CoT Mode vs. Agentic Mode on an Im2GPS3k benchmark image.
  • Figure 4: Reasoning comparison: CoT Mode vs. Agentic Mode on an image from Street View Text Dataset.
  • Figure 5: Analysis of Agent Behaviors on Im2GPS3k. (a) Distribution of tool usage combinations. (b) Visualization of the step-wise reasoning trajectories.
  • ...and 6 more figures