Table of Contents
Fetching ...

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

Lifan Jiang, Yuhang Pei, oxi Wu, Yan Zhao, Tianrun Wu, Shulong Yu, Lihui Zhang, Deng Cai

TL;DR

GeoSeg is presented, a zero-shot, training-free framework that bypasses the supervision bottleneck for reasoning-driven remote sensing segmentation and couples MLLM reasoning with precise localization via a dual-route prompting mechanism to fuse semantic intent with fine-grained spatial cues.

Abstract

Recent advances in MLLMs are reframing segmentation from fixed-category prediction to instruction-grounded localization. While reasoning based segmentation has progressed rapidly in natural scenes, remote sensing lacks a generalizable solution due to the prohibitive cost of reasoning-oriented data and domain-specific challenges like overhead viewpoints. We present GeoSeg, a zero-shot, training-free framework that bypasses the supervision bottleneck for reasoning-driven remote sensing segmentation. GeoSeg couples MLLM reasoning with precise localization via: (i) bias-aware coordinate refinement to correct systematic grounding shifts and (ii) a dual-route prompting mechanism to fuse semantic intent with fine-grained spatial cues. We also introduce GeoSeg-Bench, a diagnostic benchmark of 810 image--query pairs with hierarchical difficulty levels. Experiments show that GeoSeg consistently outperforms all baselines, with extensive ablations confirming the effectiveness and necessity of each component.

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

TL;DR

GeoSeg is presented, a zero-shot, training-free framework that bypasses the supervision bottleneck for reasoning-driven remote sensing segmentation and couples MLLM reasoning with precise localization via a dual-route prompting mechanism to fuse semantic intent with fine-grained spatial cues.

Abstract

Recent advances in MLLMs are reframing segmentation from fixed-category prediction to instruction-grounded localization. While reasoning based segmentation has progressed rapidly in natural scenes, remote sensing lacks a generalizable solution due to the prohibitive cost of reasoning-oriented data and domain-specific challenges like overhead viewpoints. We present GeoSeg, a zero-shot, training-free framework that bypasses the supervision bottleneck for reasoning-driven remote sensing segmentation. GeoSeg couples MLLM reasoning with precise localization via: (i) bias-aware coordinate refinement to correct systematic grounding shifts and (ii) a dual-route prompting mechanism to fuse semantic intent with fine-grained spatial cues. We also introduce GeoSeg-Bench, a diagnostic benchmark of 810 image--query pairs with hierarchical difficulty levels. Experiments show that GeoSeg consistently outperforms all baselines, with extensive ablations confirming the effectiveness and necessity of each component.
Paper Structure (23 sections, 7 equations, 11 figures, 3 tables)

This paper contains 23 sections, 7 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Performance across reasoning difficulty levels. We visualize our proposed GeoSeg's results on image--query pairs across three difficulty levels. The visualized masks illustrate that GeoSeg can handle varying instruction complexity and remain effective in challenging scenarios. Please refer to Appendix \ref{['app:baseline']} for more results.
  • Figure 2: Overview of the GeoSeg pipeline. Given a remote sensing image $I$ and a natural language query $q$, the pipeline operates in three stages: (1) Reasoning-Driven Grounding: The MLLM ($\mathcal{L}$) generates a coarse bounding box $b$ and extracts the object prompt $p$. (2) Bias-Aware Coordinate Refinement: To mitigate grounding bias, the box is adjusted via asymmetric expansion ($\alpha, \beta$) to yield a refined RoI $I_{b'}$. (3) Dual-Route Segmentation & Fusion: Within the RoI, we perform parallel segmentation using Route A (Visual Cues via CLIP Surgery) and Route B (Semantic Cues via SAM3 with prompt $p$). The final prediction $\hat{M}$ is obtained by integrating both paths via Intersection-First Fusion.
  • Figure 3: Quantification of domain-specific grounding drift. We analyze coordinate offsets on a held-out calibration set comprising 1,000 images randomly sampled from LoveDA, NWPU-VHR-10, and DIOR datasets. The KDE visualization reveals a systematic bottom-right shift inherent to pre-trained MLLMs under overhead views, necessitating our statistically derived asymmetric expansion ($\alpha=0.2, \beta=0.1$).
  • Figure 4: Overview of GeoSeg-Bench. (a) Representative Domains: We showcase samples from the four domains defined in our scenario taxonomy: Urban, Rural, Traffic, and Nature. (b) Hierarchical Difficulty Design: Using the Traffic domain as a case study, we illustrate the progression across three levels: Basic (Level 1), Description (Level 2), and Reasoning (Level 3), corresponding to increasing reasoning requirements.
  • Figure 5: Distribution statistics of our GeoSeg-Bench. Left: Proportional breakdown of four scenario categories (Urban, Traffic, Rural, Nature). Right: Fixed ratio composition of three hierarchical difficulty levels (Basic, Description, Reasoning).
  • ...and 6 more figures