Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation
Chufeng Zhou, Jian Wang, Xinyuan Liu, Xiaokang Zhang
TL;DR
Open-vocabulary remote sensing segmentation often mislabels classes with similar spectral features due to a lack of geospatial context. The authors propose GR-CoT, a dual-stream framework combining offline knowledge distillation to define Category Interpretation Standards and online reasoning to generate an image-adaptive vocabulary via macro-scenario anchoring and visual feature decoupling, guiding pixel-level segmentation through knowledge-driven decisions. Key contributions include the formalization of Category Interpretation Standards, the image-adaptive vocabulary generation mechanism, and strong quantitative results on LoveDA and GID5 along with qualitative evidence of reduced semantic confusion. This work introduces geospatial logic into open-vocabulary segmentation, enabling more accurate, context-aware land-cover mapping in complex geographies.
Abstract
Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing, enabling the recognition of diverse land-cover types beyond pre-defined category sets. However, existing methods predominantly rely on the passive mapping of visual features and textual embeddings. This ``appearance-based" paradigm lacks geospatial contextual awareness, leading to severe semantic ambiguity and misclassification when encountering land-cover classes with similar spectral features but distinct semantic attributes. To address this, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework designed to enhance the scene understanding capabilities of Multimodal Large Language Models (MLLMs), thereby guiding open-vocabulary segmentation models toward precise mapping. The framework comprises two collaborative components: an offline knowledge distillation stream and an online instance reasoning stream. The offline stream establishes fine-grained category interpretation standards to resolve semantic conflicts between similar land-cover types. During online inference, the framework executes a sequential reasoning process involving macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis. This process generates an image-adaptive vocabulary that guides downstream models to achieve pixel-level alignment with correct geographical semantics. Extensive experiments on the LoveDA and GID5 benchmarks demonstrate the superiority of our approach.
