Table of Contents
Fetching ...

Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation

Chengyang Ye, Yunzhi Zhuge, Pingping Zhang

TL;DR

This work defines Open-Vocabulary Remote Sensing Image Semantic Segmentation (OVRSISS) and introduces LandDiscover50K, a large-scale RSI dataset, to enable robust open-vocabulary segmentation in satellite imagery. It then presents GSNet, a framework that fuses RSI domain priors with generalist vision-language models via a Dual-Stream Image Encoder (DSIE), a Query-Guided Feature Fusion (QGFF) module, and a Residual Information Preservation Decoder (RIPD). The approach achieves state-of-the-art performance on four OVRSISS benchmarks and demonstrates that LandDiscover50K substantially improves generalization and open-class segmentation in remote sensing. Overall, the dataset and method offer a practical path toward rapid, scalable open-vocabulary RSISS in real-world applications.

Abstract

Recently, deep learning based methods have revolutionized remote sensing image segmentation. However, these methods usually rely on a pre-defined semantic class set, thus needing additional image annotation and model training when adapting to new classes. More importantly, they are unable to segment arbitrary semantic classes. In this work, we introduce Open-Vocabulary Remote Sensing Image Semantic Segmentation (OVRSISS), which aims to segment arbitrary semantic classes in remote sensing images. To address the lack of OVRSISS datasets, we develop LandDiscover50K, a comprehensive dataset of 51,846 images covering 40 diverse semantic classes. In addition, we propose a novel framework named GSNet that integrates domain priors from special remote sensing models and versatile capabilities of general vision-language models. Technically, GSNet consists of a Dual-Stream Image Encoder (DSIE), a Query-Guided Feature Fusion (QGFF), and a Residual Information Preservation Decoder (RIPD). DSIE first captures comprehensive features from both special models and general models in dual streams. Then, with the guidance of variable vocabularies, QGFF integrates specialist and generalist features, enabling them to complement each other. Finally, RIPD is proposed to aggregate multi-source features for more accurate mask predictions. Experiments show that our method outperforms other methods by a large margin, and our proposed LandDiscover50K improves the performance of OVRSISS methods. The proposed dataset and method will be made publicly available at https://github.com/yecy749/GSNet.

Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation

TL;DR

This work defines Open-Vocabulary Remote Sensing Image Semantic Segmentation (OVRSISS) and introduces LandDiscover50K, a large-scale RSI dataset, to enable robust open-vocabulary segmentation in satellite imagery. It then presents GSNet, a framework that fuses RSI domain priors with generalist vision-language models via a Dual-Stream Image Encoder (DSIE), a Query-Guided Feature Fusion (QGFF) module, and a Residual Information Preservation Decoder (RIPD). The approach achieves state-of-the-art performance on four OVRSISS benchmarks and demonstrates that LandDiscover50K substantially improves generalization and open-class segmentation in remote sensing. Overall, the dataset and method offer a practical path toward rapid, scalable open-vocabulary RSISS in real-world applications.

Abstract

Recently, deep learning based methods have revolutionized remote sensing image segmentation. However, these methods usually rely on a pre-defined semantic class set, thus needing additional image annotation and model training when adapting to new classes. More importantly, they are unable to segment arbitrary semantic classes. In this work, we introduce Open-Vocabulary Remote Sensing Image Semantic Segmentation (OVRSISS), which aims to segment arbitrary semantic classes in remote sensing images. To address the lack of OVRSISS datasets, we develop LandDiscover50K, a comprehensive dataset of 51,846 images covering 40 diverse semantic classes. In addition, we propose a novel framework named GSNet that integrates domain priors from special remote sensing models and versatile capabilities of general vision-language models. Technically, GSNet consists of a Dual-Stream Image Encoder (DSIE), a Query-Guided Feature Fusion (QGFF), and a Residual Information Preservation Decoder (RIPD). DSIE first captures comprehensive features from both special models and general models in dual streams. Then, with the guidance of variable vocabularies, QGFF integrates specialist and generalist features, enabling them to complement each other. Finally, RIPD is proposed to aggregate multi-source features for more accurate mask predictions. Experiments show that our method outperforms other methods by a large margin, and our proposed LandDiscover50K improves the performance of OVRSISS methods. The proposed dataset and method will be made publicly available at https://github.com/yecy749/GSNet.
Paper Structure (47 sections, 8 equations, 10 figures, 7 tables)

This paper contains 47 sections, 8 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Comparison of different learning paradigms for RSISS. (a) Fully supervised methods train and test on the same dataset. (b) Few-shot methods train on large annotated base classes and test on novel classes using a small support set. (c) Semi-supervised methods use large-scale unlabeled data with small-scale labeled base classes for training, then test on base classes. (d) Open-Vocabulary methods train on large-scale labeled data and test on arbitrary semantic classes. (e) Our framework is illustrated in brief.
  • Figure 2: Illustration of the semantic class distribution and visual samples from LandDiscover50K. Sample images are strategically positioned adjacent to their corresponding semantic class tags for clarity.
  • Figure 3: The overall architecture of GSNet. DSIE consists of a generalist CLIP backbone and a specialist RSI backbone. The specialist RSI backbone is pre-trained on RSI using self-supervised learning paradigm, while CLIP is pre-trained on image-text datasets using contrastive learning paradigm. QGFF enables dual stream features to complement each other under the guidance of variable vocabularies. RIPD further aggregates the multi-source features for more accurate mask predictions.
  • Figure 4: Qualitative evaluation of GSNet. Our method outperforms CAT-SEG in both semantic understanding and edge prediction.
  • Figure 5: Performance of GSNet trained on different sizes of subsets of LandDiscover50K.
  • ...and 5 more figures