Table of Contents
Fetching ...

GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics

Modi Jin, Yiming Zhang, Boyuan Sun, Dingwen Zhang, MingMing Cheng, Qibin Hou

TL;DR

GeoAgent tackles the challenge of geolocating images by aligning model reasoning with human geographic cognition. It introduces GeoSeek, a three-part dataset with human-annotated CoT, bias-aware sampling, and fine-grained location labels, and couples it with a geo-similarity reward and a consistency reward within a two-stage training (SFT + GRPO). The approach yields superior accuracy across city/region/country/continent scales on IM2GPS3K and GeoSeek-Val, while generating human-aligned, hierarchical reasoning. This work advances open-world geolocation with interpretable, geography-aware RL objectives and robust data design, enabling practical, fine-grained localization performance. The contributions offer a path toward more reliable, explainable geolocation systems that better model human geographic reasoning in open environments.

Abstract

This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought (CoT) data and training strategies, which conflict with geographic characteristics. To address these issues, we first introduce GeoSeek, a new geolocation dataset comprising CoT data annotated by geographic experts and professional players. We further thoroughly explore the inherent characteristics of geographic tasks and propose a geo-similarity reward and a consistency reward assessed by a consistency agent to assist training. This encourages the model to converge towards correct answers from a geographic perspective while ensuring the integrity and consistency of its reasoning process. Experimental results show that GeoAgent outperforms existing methods and a series of general VLLMs across multiple grains, while generating reasoning that closely aligns with humans.

GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics

TL;DR

GeoAgent tackles the challenge of geolocating images by aligning model reasoning with human geographic cognition. It introduces GeoSeek, a three-part dataset with human-annotated CoT, bias-aware sampling, and fine-grained location labels, and couples it with a geo-similarity reward and a consistency reward within a two-stage training (SFT + GRPO). The approach yields superior accuracy across city/region/country/continent scales on IM2GPS3K and GeoSeek-Val, while generating human-aligned, hierarchical reasoning. This work advances open-world geolocation with interpretable, geography-aware RL objectives and robust data design, enabling practical, fine-grained localization performance. The contributions offer a path toward more reliable, explainable geolocation systems that better model human geographic reasoning in open environments.

Abstract

This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought (CoT) data and training strategies, which conflict with geographic characteristics. To address these issues, we first introduce GeoSeek, a new geolocation dataset comprising CoT data annotated by geographic experts and professional players. We further thoroughly explore the inherent characteristics of geographic tasks and propose a geo-similarity reward and a consistency reward assessed by a consistency agent to assist training. This encourages the model to converge towards correct answers from a geographic perspective while ensuring the integrity and consistency of its reasoning process. Experimental results show that GeoAgent outperforms existing methods and a series of general VLLMs across multiple grains, while generating reasoning that closely aligns with humans.
Paper Structure (35 sections, 12 equations, 14 figures, 6 tables)

This paper contains 35 sections, 12 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: GeoSeek Dataset. We train GeoAgent with GeoSeek, a geolocation dataset with bias-reducing sampling and a val-bench annotated with locatability and geographic elements. Remarkably, a single image may contain multiple geographic elements.
  • Figure 2: Data construction and training pipeline of GeoAgent. GeoSeek-CoT contains 10k high-quality reasoning processes labeled by geography experts and geolocation game players. GeoSeek-Loc includes 20k images for the cold start of GeoAgent-SFT. During the GRPO-based training, based on GeoAgent-SFT, we design the geo-similarity reward to encourage the model to converge towards correct answers both physically and semantically. Also, the consistency reward is introduced to keep the integrity and consistency of CoT.
  • Figure 3: Inconsistent CoT and different descriptions of the same location. Left: incomplete and inconsistent CoT, Right: consistent CoT after training with the consistency agent. Meanwhile, different final answers probably refer to the same location (e.g., Hefei and Hefei City). Therefore, Geo-Similarity is introduced to solve this problem.
  • Figure 4: Discussion of reward functions. The two scatter plots respectively reveal the reasons for the unreasonable directly-judge reward and the positive effect of semantic similarity reward. We also demonstrate the curve of reward value changes over training steps, reflecting the importance of applying consistency reward to enable the model to establish a complete reasoning framework.
  • Figure 5: GeoScore on GeoSeek-Val. We compare different models in multiple locatabilities and geographic elements.
  • ...and 9 more figures