Table of Contents
Fetching ...

Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning

Chenhui Xu, Fuxun Yu, Michael J. Bianco, Jacob Kovarskiy, Raphael Tang, Qi Zhang, Zirui Xu, Will LeVine, Brandon Dubbs, Heming Liao, Cassandra Burgess, Suvam Bag, Jay Patravali, Rupanjali Kukal, Mikael Figueroa, Rishi Madhok, Nikolaos Karianakis, Jinjun Xiong

TL;DR

Geo-R1 addresses the scarcity of geospatial reasoning supervision by proposing a two-stage post-training framework: a scaffolding stage that teaches a geospatial thinking paradigm via synthetic chain-of-thought data, and an elevating stage that uses GRPO-based RLVR with a cross-view pairing reward to refine reasoning under weak supervision. The approach yields substantial gains on GeoChain and IMAGEO benchmarks, demonstrating strong out-of-distribution generalization while preserving primitive multimodal abilities. By marrying imitation-based thinking with outcome-driven reinforcement learning, Geo-R1 enables open VLMs to perform cross-view geospatial reasoning without dense annotations, with practical implications for disaster response, urban planning, and geospatial analytics.

Abstract

We introduce Geo-R1, a reasoning-centric post-training framework that unlocks geospatial reasoning in vision-language models by combining thinking scaffolding and elevating. In the scaffolding stage, Geo-R1 instills a ``geospatial thinking paradigm" via supervised fine-tuning on synthetic chain-of-thought exemplars, enabling models to connect visual cues with geographic priors without costly human reasoning annotations. In the elevating stage, it uses GRPO-based reinforcement learning on a weakly-supervised cross-view pairing proxy. This design supplies a verifiable and scalable reward signal: teaching models to capture and reconcile features across modalities, and harnessing reasoning for accurate prediction. Geo-R1 extends geospatial modeling from domain pretraining / supervised finetuning to reasoning-first post-training, and achieves state-of-the-art performance across various geospatial reasoning benchmarks. Our model is available at https://huggingface.co/miniHui/Geo-R1.

Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning

TL;DR

Geo-R1 addresses the scarcity of geospatial reasoning supervision by proposing a two-stage post-training framework: a scaffolding stage that teaches a geospatial thinking paradigm via synthetic chain-of-thought data, and an elevating stage that uses GRPO-based RLVR with a cross-view pairing reward to refine reasoning under weak supervision. The approach yields substantial gains on GeoChain and IMAGEO benchmarks, demonstrating strong out-of-distribution generalization while preserving primitive multimodal abilities. By marrying imitation-based thinking with outcome-driven reinforcement learning, Geo-R1 enables open VLMs to perform cross-view geospatial reasoning without dense annotations, with practical implications for disaster response, urban planning, and geospatial analytics.

Abstract

We introduce Geo-R1, a reasoning-centric post-training framework that unlocks geospatial reasoning in vision-language models by combining thinking scaffolding and elevating. In the scaffolding stage, Geo-R1 instills a ``geospatial thinking paradigm" via supervised fine-tuning on synthetic chain-of-thought exemplars, enabling models to connect visual cues with geographic priors without costly human reasoning annotations. In the elevating stage, it uses GRPO-based reinforcement learning on a weakly-supervised cross-view pairing proxy. This design supplies a verifiable and scalable reward signal: teaching models to capture and reconcile features across modalities, and harnessing reasoning for accurate prediction. Geo-R1 extends geospatial modeling from domain pretraining / supervised finetuning to reasoning-first post-training, and achieves state-of-the-art performance across various geospatial reasoning benchmarks. Our model is available at https://huggingface.co/miniHui/Geo-R1.

Paper Structure

This paper contains 46 sections, 7 equations, 23 figures, 13 tables.

Figures (23)

  • Figure 1: Geo-R1 significantly outperforms baseline bai2025qwen2 across 13 verifiable geo-reasoning tasks on the GeoChain benchmark yerramilli2025geochain in the zero-shot setting. See Table \ref{['tab:geochain question']} for detailed description of these tasks.
  • Figure 2: Geo-R1 overview. Geo-R1 provide a framework for building geospatial reasoning.
  • Figure 3: Geospatial thinking CoT data engine.
  • Figure 4: Cross-view pairing task for reinforcement learning with verifiable rewards.
  • Figure 5: Results on IMAGEO dataset-GSS li2025pixelsplacessystematicbenchmark
  • ...and 18 more figures