Table of Contents
Fetching ...

RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow

Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, Pai Peng

TL;DR

The paper tackles the challenge of interpreting complex user intents and spatial relationships in Earth observation data by proposing RemoteReasoner, a unified, RL-trained geospatial reasoning workflow. It integrates a multi-modal language model with a flexible, task-transformation inference pipeline to produce pixel-, region-, and contour-level outputs from a single forward pass, avoiding task-specific decoders. Training via Group Relative Policy Optimization and a composite reward fosters autonomous reasoning while preserving the MLLM's generalization, enabling robust performance on unseen tasks and out-of-distribution categories. The approach achieves state-of-the-art results across multi-granularity reasoning tasks and demonstrates strong zero-shot and OOD capabilities, offering a flexible foundation for downstream geospatial intelligence with efficient inference through Mask2Contour-based contours.

Abstract

Remote sensing imagery presents vast, inherently unstructured spatial data, necessitating sophisticated reasoning to interpret complex user intents and contextual relationships beyond simple recognition tasks. In this paper, we aim to construct an Earth observation workflow to handle complex queries by reasoning about spatial context and user intent. As a reasoning workflow, it should autonomously explore and construct its own inference paths, rather than being confined to predefined ground-truth sequences. Ideally, its architecture ought to be unified yet generalized, possessing capabilities to perform diverse reasoning tasks through one model without requiring additional fine-tuning. Existing remote sensing approaches rely on supervised fine-tuning paradigms and task-specific heads, limiting both autonomous reasoning and unified generalization. To this end, we propose RemoteReasoner, a unified workflow for geospatial reasoning. The design of RemoteReasoner integrates a multi-modal large language model (MLLM) for interpreting user instructions and localizing targets, together with task transformation strategies that enable multi-granularity tasks, including object-, region-, and pixel-level. In contrast to existing methods, our framework is trained with reinforcement learning (RL) to endow the MLLM sufficient reasoning autonomy. At the inference stage, our transformation strategies enable diverse task output formats without requiring task-specific decoders or further fine-tuning. Experiments demonstrated that RemoteReasoner achieves state-of-the-art (SOTA) performance across multi-granularity reasoning tasks. Furthermore, it retains the MLLM's inherent generalization capability, demonstrating robust performance on unseen tasks and out-of-distribution categories.

RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow

TL;DR

The paper tackles the challenge of interpreting complex user intents and spatial relationships in Earth observation data by proposing RemoteReasoner, a unified, RL-trained geospatial reasoning workflow. It integrates a multi-modal language model with a flexible, task-transformation inference pipeline to produce pixel-, region-, and contour-level outputs from a single forward pass, avoiding task-specific decoders. Training via Group Relative Policy Optimization and a composite reward fosters autonomous reasoning while preserving the MLLM's generalization, enabling robust performance on unseen tasks and out-of-distribution categories. The approach achieves state-of-the-art results across multi-granularity reasoning tasks and demonstrates strong zero-shot and OOD capabilities, offering a flexible foundation for downstream geospatial intelligence with efficient inference through Mask2Contour-based contours.

Abstract

Remote sensing imagery presents vast, inherently unstructured spatial data, necessitating sophisticated reasoning to interpret complex user intents and contextual relationships beyond simple recognition tasks. In this paper, we aim to construct an Earth observation workflow to handle complex queries by reasoning about spatial context and user intent. As a reasoning workflow, it should autonomously explore and construct its own inference paths, rather than being confined to predefined ground-truth sequences. Ideally, its architecture ought to be unified yet generalized, possessing capabilities to perform diverse reasoning tasks through one model without requiring additional fine-tuning. Existing remote sensing approaches rely on supervised fine-tuning paradigms and task-specific heads, limiting both autonomous reasoning and unified generalization. To this end, we propose RemoteReasoner, a unified workflow for geospatial reasoning. The design of RemoteReasoner integrates a multi-modal large language model (MLLM) for interpreting user instructions and localizing targets, together with task transformation strategies that enable multi-granularity tasks, including object-, region-, and pixel-level. In contrast to existing methods, our framework is trained with reinforcement learning (RL) to endow the MLLM sufficient reasoning autonomy. At the inference stage, our transformation strategies enable diverse task output formats without requiring task-specific decoders or further fine-tuning. Experiments demonstrated that RemoteReasoner achieves state-of-the-art (SOTA) performance across multi-granularity reasoning tasks. Furthermore, it retains the MLLM's inherent generalization capability, demonstrating robust performance on unseen tasks and out-of-distribution categories.

Paper Structure

This paper contains 25 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison between Existing Frameworks and RemoteReasoner. Existing remote sensing reasoning approach (e.g., SegEarth-R1) requires SFT with annotated reasoning processes, and is limited to single-task outputs and task decoder. In contrast, our framework supports unsupervised reasoning and multi-granularity tasks.
  • Figure 2: Overview of our RemoteReasoner. We utilize GRPO shao2024deepseekmath to explore the model's self-thinking capability. Then we design an inference workflow to perform multi-granularity reasoning tasks.
  • Figure 3: Geospatial Contour Reasoning Results. We select EPOC chen2024subobject for comparison.
  • Figure 4: Quantitative results of other tasks (Image Captioning & VQA).
  • Figure 5: Qualitative Results. Given an image and its corresponding implicit query, RemoteReasoner autonomously identifies the user-intended target category through reasoning and accurately executes visual-centric tasks across three granularities.
  • ...and 1 more figures