Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning

Zilun Zhang; Zian Guan; Tiancheng Zhao; Haozhan Shen; Tianyu Li; Yuxiang Cai; Zhonggen Su; Zhaojun Liu; Jianwei Yin; Xiang Li

Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning

Zilun Zhang, Zian Guan, Tiancheng Zhao, Haozhan Shen, Tianyu Li, Yuxiang Cai, Zhonggen Su, Zhaojun Liu, Jianwei Yin, Xiang Li

TL;DR

This work tackles few-shot geospatial referring expression understanding by introducing Geo-R1, a reinforcement fine-tuning framework that enforces explicit reasoning before grounding. It adapts Group Relative Policy Optimization (GRPO) to vision–language grounding, using task-specific rewards (Format and Metrics) to train models to generate interpretable reasoning chains before localizing targets. Geo-R1 is validated on three few-shot RS benchmarks (VRSBench-FS, EarthReason-FS, NWPU-FS) and demonstrates strong improvements over SFT baselines, with notable cross-dataset generalization and data efficiency, even approaching full-data performance with relatively few examples. The approach enhances interpretability through reasoning traces and offers practical value for RS applications where labeled data is scarce, while providing open-source code and reproducible protocols for future research.

Abstract

Referring expression understanding in remote sensing poses unique challenges, as it requires reasoning over complex object-context relationships. While supervised fine-tuning (SFT) on multimodal large language models achieves strong performance with massive labeled datasets, they struggle in data-scarce scenarios, leading to poor generalization. To address this limitation, we propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring. Geo-R1 enforces the model to first generate explicit, interpretable reasoning chains that decompose referring expressions, and then leverage these rationales to localize target objects. This "reason first, then act" process enables the model to make more effective use of limited annotations, enhances generalization, and provides interpretability. We validate Geo-R1 on three carefully designed few-shot geospatial referring benchmarks, where our model consistently and substantially outperforms SFT baselines. It also demonstrates strong cross-dataset generalization, highlighting its robustness. Code and data will be released at: https://github.com/Geo-R1/geo-r1.

Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning

TL;DR

Abstract

Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)