Table of Contents
Fetching ...

Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation

Chufeng Zhou, Jian Wang, Xinyuan Liu, Xiaokang Zhang

TL;DR

Open-vocabulary remote sensing segmentation often mislabels classes with similar spectral features due to a lack of geospatial context. The authors propose GR-CoT, a dual-stream framework combining offline knowledge distillation to define Category Interpretation Standards and online reasoning to generate an image-adaptive vocabulary via macro-scenario anchoring and visual feature decoupling, guiding pixel-level segmentation through knowledge-driven decisions. Key contributions include the formalization of Category Interpretation Standards, the image-adaptive vocabulary generation mechanism, and strong quantitative results on LoveDA and GID5 along with qualitative evidence of reduced semantic confusion. This work introduces geospatial logic into open-vocabulary segmentation, enabling more accurate, context-aware land-cover mapping in complex geographies.

Abstract

Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing, enabling the recognition of diverse land-cover types beyond pre-defined category sets. However, existing methods predominantly rely on the passive mapping of visual features and textual embeddings. This ``appearance-based" paradigm lacks geospatial contextual awareness, leading to severe semantic ambiguity and misclassification when encountering land-cover classes with similar spectral features but distinct semantic attributes. To address this, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework designed to enhance the scene understanding capabilities of Multimodal Large Language Models (MLLMs), thereby guiding open-vocabulary segmentation models toward precise mapping. The framework comprises two collaborative components: an offline knowledge distillation stream and an online instance reasoning stream. The offline stream establishes fine-grained category interpretation standards to resolve semantic conflicts between similar land-cover types. During online inference, the framework executes a sequential reasoning process involving macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis. This process generates an image-adaptive vocabulary that guides downstream models to achieve pixel-level alignment with correct geographical semantics. Extensive experiments on the LoveDA and GID5 benchmarks demonstrate the superiority of our approach.

Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation

TL;DR

Open-vocabulary remote sensing segmentation often mislabels classes with similar spectral features due to a lack of geospatial context. The authors propose GR-CoT, a dual-stream framework combining offline knowledge distillation to define Category Interpretation Standards and online reasoning to generate an image-adaptive vocabulary via macro-scenario anchoring and visual feature decoupling, guiding pixel-level segmentation through knowledge-driven decisions. Key contributions include the formalization of Category Interpretation Standards, the image-adaptive vocabulary generation mechanism, and strong quantitative results on LoveDA and GID5 along with qualitative evidence of reduced semantic confusion. This work introduces geospatial logic into open-vocabulary segmentation, enabling more accurate, context-aware land-cover mapping in complex geographies.

Abstract

Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing, enabling the recognition of diverse land-cover types beyond pre-defined category sets. However, existing methods predominantly rely on the passive mapping of visual features and textual embeddings. This ``appearance-based" paradigm lacks geospatial contextual awareness, leading to severe semantic ambiguity and misclassification when encountering land-cover classes with similar spectral features but distinct semantic attributes. To address this, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework designed to enhance the scene understanding capabilities of Multimodal Large Language Models (MLLMs), thereby guiding open-vocabulary segmentation models toward precise mapping. The framework comprises two collaborative components: an offline knowledge distillation stream and an online instance reasoning stream. The offline stream establishes fine-grained category interpretation standards to resolve semantic conflicts between similar land-cover types. During online inference, the framework executes a sequential reasoning process involving macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis. This process generates an image-adaptive vocabulary that guides downstream models to achieve pixel-level alignment with correct geographical semantics. Extensive experiments on the LoveDA and GID5 benchmarks demonstrate the superiority of our approach.
Paper Structure (10 sections, 5 equations, 3 figures, 3 tables)

This paper contains 10 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The proposed framework of geospatial reasoning chain-of-thought (GR-CoT) for remote sensing semantic segmentation. The architecture consists of two collaborative streams: an offline knowledge distillation stream (top) and an online instance reasoning stream (bottom). The offline stream establishes category interpretation standards via fine-grained discrimination to resolve semantic ambiguity. The online stream sequentially performs macro-scenario anchoring and visual feature decoupling, which are integrated in the knowledge-driven decision synthesis stage to generate an image-adaptive vocabulary. This refined vocabulary is then utilized by an open-vocabulary segmentation model to produce the final semantic mapping results.
  • Figure 2: Visualized results on the LoveDA dataset.
  • Figure 3: Visualized results on the GID5 dataset.