GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics
Modi Jin, Yiming Zhang, Boyuan Sun, Dingwen Zhang, MingMing Cheng, Qibin Hou
TL;DR
GeoAgent tackles the challenge of geolocating images by aligning model reasoning with human geographic cognition. It introduces GeoSeek, a three-part dataset with human-annotated CoT, bias-aware sampling, and fine-grained location labels, and couples it with a geo-similarity reward and a consistency reward within a two-stage training (SFT + GRPO). The approach yields superior accuracy across city/region/country/continent scales on IM2GPS3K and GeoSeek-Val, while generating human-aligned, hierarchical reasoning. This work advances open-world geolocation with interpretable, geography-aware RL objectives and robust data design, enabling practical, fine-grained localization performance. The contributions offer a path toward more reliable, explainable geolocation systems that better model human geographic reasoning in open environments.
Abstract
This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought (CoT) data and training strategies, which conflict with geographic characteristics. To address these issues, we first introduce GeoSeek, a new geolocation dataset comprising CoT data annotated by geographic experts and professional players. We further thoroughly explore the inherent characteristics of geographic tasks and propose a geo-similarity reward and a consistency reward assessed by a consistency agent to assist training. This encourages the model to converge towards correct answers from a geographic perspective while ensuring the integrity and consistency of its reasoning process. Experimental results show that GeoAgent outperforms existing methods and a series of general VLLMs across multiple grains, while generating reasoning that closely aligns with humans.
