RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow
Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, Pai Peng
TL;DR
The paper tackles the challenge of interpreting complex user intents and spatial relationships in Earth observation data by proposing RemoteReasoner, a unified, RL-trained geospatial reasoning workflow. It integrates a multi-modal language model with a flexible, task-transformation inference pipeline to produce pixel-, region-, and contour-level outputs from a single forward pass, avoiding task-specific decoders. Training via Group Relative Policy Optimization and a composite reward fosters autonomous reasoning while preserving the MLLM's generalization, enabling robust performance on unseen tasks and out-of-distribution categories. The approach achieves state-of-the-art results across multi-granularity reasoning tasks and demonstrates strong zero-shot and OOD capabilities, offering a flexible foundation for downstream geospatial intelligence with efficient inference through Mask2Contour-based contours.
Abstract
Remote sensing imagery presents vast, inherently unstructured spatial data, necessitating sophisticated reasoning to interpret complex user intents and contextual relationships beyond simple recognition tasks. In this paper, we aim to construct an Earth observation workflow to handle complex queries by reasoning about spatial context and user intent. As a reasoning workflow, it should autonomously explore and construct its own inference paths, rather than being confined to predefined ground-truth sequences. Ideally, its architecture ought to be unified yet generalized, possessing capabilities to perform diverse reasoning tasks through one model without requiring additional fine-tuning. Existing remote sensing approaches rely on supervised fine-tuning paradigms and task-specific heads, limiting both autonomous reasoning and unified generalization. To this end, we propose RemoteReasoner, a unified workflow for geospatial reasoning. The design of RemoteReasoner integrates a multi-modal large language model (MLLM) for interpreting user instructions and localizing targets, together with task transformation strategies that enable multi-granularity tasks, including object-, region-, and pixel-level. In contrast to existing methods, our framework is trained with reinforcement learning (RL) to endow the MLLM sufficient reasoning autonomy. At the inference stage, our transformation strategies enable diverse task output formats without requiring task-specific decoders or further fine-tuning. Experiments demonstrated that RemoteReasoner achieves state-of-the-art (SOTA) performance across multi-granularity reasoning tasks. Furthermore, it retains the MLLM's inherent generalization capability, demonstrating robust performance on unseen tasks and out-of-distribution categories.
