Table of Contents
Fetching ...

GRASP: Geospatial pixel Reasoning viA Structured Policy learning

Chengjie Jiang, Yunqi Zhou, Jiafeng Yan, Jing Li, Jiayang Li, Yue Zhou, Hongjie He, Jonathan Li

TL;DR

Geospatial pixel reasoning aims to generate precise segmentation masks from natural-language queries but is hampered by the high cost of dense mask annotations and poor generalization under supervised fine-tuning. The authors propose GRASP, a cascaded framework that decouples reasoning and segmentation by using a pretrained multimodal LLM to produce structured reasoning and spatial prompts, which feed a frozen SAM2 segmentation model. To enhance generalization and reduce annotation burden, they introduce PRIME, a pure reinforcement learning paradigm, and BoP-Rewards, a cost-aware scheme that substitutes dense masks with bounding boxes and two positive points while enforcing a parsable reasoning format and localization accuracy. Evaluations on EarthReason, GeoPixInstruct, and a new GRASP-1k benchmark show state-of-the-art in-domain performance and up to 54% gains in out-of-domain scenarios, validating the effectiveness and scalability of RL with structured, cost-efficient supervision for geospatial pixel reasoning. The work also provides rigorous data construction steps and releases GRASP-1k to foster robust cross-domain evaluation in this domain.

Abstract

Geospatial pixel reasoning aims to generate segmentation masks in remote sensing imagery directly from natural-language instructions. Most existing approaches follow a paradigm that fine-tunes multimodal large language models under supervision with dense pixel-level masks as ground truth. While effective within the training data distribution, this design suffers from two main drawbacks: (1) the high cost of large-scale dense mask annotation, and (2) the limited generalization capability of supervised fine-tuning in out-of-domain scenarios. To address these issues, we propose GRASP, a structured policy-learning framework that integrates a multimodal large language model with a pretrained segmentation model in a cascaded manner. To enhance generalization, we introduce PRIME, a training paradigm that replaces supervised fine-tuning with reinforcement learning to better align reasoning and grounding behaviors with task objectives. To reduce annotation costs, we design BoP-Rewards, which substitutes dense mask labels with bounding box and positive points. It further verifies outputs through two complementary signals: format, which constrains the reasoning and grounding structure to remain syntactically parsable, and accuracy, which evaluates the quality of predicted boxes and points. For evaluation, we train our method and all baselines on EarthReason and GeoPixInstruct, constructing an in-domain benchmark by merging their test sets. We further release GRASP-1k, a fully out-of-domain benchmark with reasoning-intensive queries, reasoning traces, and fine-grained masks. Experimental results demonstrate state-of-the-art (SOTA) in-domain performance and up to 54\% improvement in out-of-domain scenarios, confirming that reinforcement learning with cost-aware rewards provides a robust and scalable paradigm for geospatial pixel reasoning. All code and datasets will be released publicly.

GRASP: Geospatial pixel Reasoning viA Structured Policy learning

TL;DR

Geospatial pixel reasoning aims to generate precise segmentation masks from natural-language queries but is hampered by the high cost of dense mask annotations and poor generalization under supervised fine-tuning. The authors propose GRASP, a cascaded framework that decouples reasoning and segmentation by using a pretrained multimodal LLM to produce structured reasoning and spatial prompts, which feed a frozen SAM2 segmentation model. To enhance generalization and reduce annotation burden, they introduce PRIME, a pure reinforcement learning paradigm, and BoP-Rewards, a cost-aware scheme that substitutes dense masks with bounding boxes and two positive points while enforcing a parsable reasoning format and localization accuracy. Evaluations on EarthReason, GeoPixInstruct, and a new GRASP-1k benchmark show state-of-the-art in-domain performance and up to 54% gains in out-of-domain scenarios, validating the effectiveness and scalability of RL with structured, cost-efficient supervision for geospatial pixel reasoning. The work also provides rigorous data construction steps and releases GRASP-1k to foster robust cross-domain evaluation in this domain.

Abstract

Geospatial pixel reasoning aims to generate segmentation masks in remote sensing imagery directly from natural-language instructions. Most existing approaches follow a paradigm that fine-tunes multimodal large language models under supervision with dense pixel-level masks as ground truth. While effective within the training data distribution, this design suffers from two main drawbacks: (1) the high cost of large-scale dense mask annotation, and (2) the limited generalization capability of supervised fine-tuning in out-of-domain scenarios. To address these issues, we propose GRASP, a structured policy-learning framework that integrates a multimodal large language model with a pretrained segmentation model in a cascaded manner. To enhance generalization, we introduce PRIME, a training paradigm that replaces supervised fine-tuning with reinforcement learning to better align reasoning and grounding behaviors with task objectives. To reduce annotation costs, we design BoP-Rewards, which substitutes dense mask labels with bounding box and positive points. It further verifies outputs through two complementary signals: format, which constrains the reasoning and grounding structure to remain syntactically parsable, and accuracy, which evaluates the quality of predicted boxes and points. For evaluation, we train our method and all baselines on EarthReason and GeoPixInstruct, constructing an in-domain benchmark by merging their test sets. We further release GRASP-1k, a fully out-of-domain benchmark with reasoning-intensive queries, reasoning traces, and fine-grained masks. Experimental results demonstrate state-of-the-art (SOTA) in-domain performance and up to 54\% improvement in out-of-domain scenarios, confirming that reinforcement learning with cost-aware rewards provides a robust and scalable paradigm for geospatial pixel reasoning. All code and datasets will be released publicly.

Paper Structure

This paper contains 35 sections, 15 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overview of our method’s advantages. (a.) Existing geospatial pixel reasoning paradigm: MLLM is trained with dense mask supervision under SFT, which leads to costly annotation and poor OOD generalization. (b.) Our paradigm: GRASP introduces PRIME, where the MLLM is optimized purely with RL. To replace dense mask labels, we design BoP-Rewards, which guide training through structured format and accuracy signals. (c.) Annotation efficiency: fine-grained mask labeling takes $\sim$30s per sample and, for example, outlining the pier in the figure requires a polygon with 38 vertices. In contrast, our scheme only needs one bounding box and two positive points, requiring $\sim$3s. (d.) Quantitative comparisons: our method achieves SOTA performance with clear gains in both in-domain (+4%) and out-of-domain (+54%) benchmarks.
  • Figure 2: Overview of the model architecture and the training workflow. The framework consists of two main components: an MLLM and a segmentation model. The MLLM comprises a vision encoder and a language model (LM) decoder. The LM decoder output contains both reasoning tokens and spatial grounding predictions in the form of bounding boxes and positive points. The spatial grounding predictions are fed into the SAM2 prompt encoder as prompts, while the original image is simultaneously input to the SAM2 image encoder. Finally, the SAM2 decoder combines both sources of information to produce the segmentation mask.
  • Figure 3: User prompt for GRASP.{Question} is replaced with geospatial pixel reasoning question $Q$ in both training and inference stage.
  • Figure 4: Reconstruction pipeline for training data. We transform dense segmentation masks into sparse supervision comprising a bounding box and two positive points.
  • Figure 5: Overview of the GRASP-1k construction pipeline. We first curate seven OOD image pools and filter out low-quality images with BRISQUE scores above 50. For the remaining images, Gemini-2.5-Pro is used to generate reasoning-intensive questions, complete with detailed explanations and spatially grounded answers. Human annotators then click on positive points indicated by the answer and leverage SAM2 for rapid segmentation. In challenging cases where SAM2 fails, manual annotations are performed using LabelMe.
  • ...and 4 more figures