GRASP: Geospatial pixel Reasoning viA Structured Policy learning
Chengjie Jiang, Yunqi Zhou, Jiafeng Yan, Jing Li, Jiayang Li, Yue Zhou, Hongjie He, Jonathan Li
TL;DR
Geospatial pixel reasoning aims to generate precise segmentation masks from natural-language queries but is hampered by the high cost of dense mask annotations and poor generalization under supervised fine-tuning. The authors propose GRASP, a cascaded framework that decouples reasoning and segmentation by using a pretrained multimodal LLM to produce structured reasoning and spatial prompts, which feed a frozen SAM2 segmentation model. To enhance generalization and reduce annotation burden, they introduce PRIME, a pure reinforcement learning paradigm, and BoP-Rewards, a cost-aware scheme that substitutes dense masks with bounding boxes and two positive points while enforcing a parsable reasoning format and localization accuracy. Evaluations on EarthReason, GeoPixInstruct, and a new GRASP-1k benchmark show state-of-the-art in-domain performance and up to 54% gains in out-of-domain scenarios, validating the effectiveness and scalability of RL with structured, cost-efficient supervision for geospatial pixel reasoning. The work also provides rigorous data construction steps and releases GRASP-1k to foster robust cross-domain evaluation in this domain.
Abstract
Geospatial pixel reasoning aims to generate segmentation masks in remote sensing imagery directly from natural-language instructions. Most existing approaches follow a paradigm that fine-tunes multimodal large language models under supervision with dense pixel-level masks as ground truth. While effective within the training data distribution, this design suffers from two main drawbacks: (1) the high cost of large-scale dense mask annotation, and (2) the limited generalization capability of supervised fine-tuning in out-of-domain scenarios. To address these issues, we propose GRASP, a structured policy-learning framework that integrates a multimodal large language model with a pretrained segmentation model in a cascaded manner. To enhance generalization, we introduce PRIME, a training paradigm that replaces supervised fine-tuning with reinforcement learning to better align reasoning and grounding behaviors with task objectives. To reduce annotation costs, we design BoP-Rewards, which substitutes dense mask labels with bounding box and positive points. It further verifies outputs through two complementary signals: format, which constrains the reasoning and grounding structure to remain syntactically parsable, and accuracy, which evaluates the quality of predicted boxes and points. For evaluation, we train our method and all baselines on EarthReason and GeoPixInstruct, constructing an in-domain benchmark by merging their test sets. We further release GRASP-1k, a fully out-of-domain benchmark with reasoning-intensive queries, reasoning traces, and fine-grained masks. Experimental results demonstrate state-of-the-art (SOTA) in-domain performance and up to 54\% improvement in out-of-domain scenarios, confirming that reinforcement learning with cost-aware rewards provides a robust and scalable paradigm for geospatial pixel reasoning. All code and datasets will be released publicly.
