Table of Contents
Fetching ...

Remote Sensing Semantic Segmentation Quality Assessment based on Vision Language Model

Huiying Shi, Zhihong Tan, Zhihan Zhang, Hongchen Wei, Yaosi Hu, Yingxue Zhang, Zhenzhong Chen

TL;DR

RS-SQA introduces an unsupervised framework for remote sensing semantic segmentation quality assessment by fusing high-level semantic features from a RS-tailored Vision-Language Model (CLIP-RS) with low-level segmentation features. A data-purified CLIP-RS is pretrained on a 10M RS image-text dataset to robustly capture geo-semantic cues, and RS-SQA uses a Simple Cross-Gating Block to integrate semantic and segmentation information, followed by a two-layer MLP to predict a 0–1 quality score. The RS-SQED dataset provides OA-based segmentation quality labels for eight methods across multiple RS datasets, enabling rigorous evaluation and method ranking. Across extensive experiments, RS-SQA outperforms classical NR-IQA baselines and many deep-learning NR-IQA models, and demonstrates practical value in recommending the best segmentation method for a given RSI, with ablations highlighting the contributions of data purification, model design, and loss terms.

Abstract

The complexity of scenes and variations in image quality result in significant variability in the performance of semantic segmentation methods of remote sensing imagery (RSI) in supervised real-world scenarios. This makes the evaluation of semantic segmentation quality in such scenarios an issue to be resolved. However, most of the existing evaluation metrics are developed based on expert-labeled object-level annotations, which are not applicable in such scenarios. To address this issue, we propose RS-SQA, an unsupervised quality assessment model for RSI semantic segmentation based on vision language model (VLM). This framework leverages a pre-trained RS VLM for semantic understanding and utilizes intermediate features from segmentation methods to extract implicit information about segmentation quality. Specifically, we introduce CLIP-RS, a large-scale pre-trained VLM trained with purified text to reduce textual noise and capture robust semantic information in the RS domain. Feature visualizations confirm that CLIP-RS can effectively differentiate between various levels of segmentation quality. Semantic features and low-level segmentation features are effectively integrated through a semantic-guided approach to enhance evaluation accuracy. To further support the development of RS semantic segmentation quality assessment, we present RS-SQED, a dedicated dataset sampled from four major RS semantic segmentation datasets and annotated with segmentation accuracy derived from the inference results of 8 representative segmentation methods. Experimental results on the established dataset demonstrate that RS-SQA significantly outperforms state-of-the-art quality assessment models. This provides essential support for predicting segmentation accuracy and high-quality semantic segmentation interpretation, offering substantial practical value.

Remote Sensing Semantic Segmentation Quality Assessment based on Vision Language Model

TL;DR

RS-SQA introduces an unsupervised framework for remote sensing semantic segmentation quality assessment by fusing high-level semantic features from a RS-tailored Vision-Language Model (CLIP-RS) with low-level segmentation features. A data-purified CLIP-RS is pretrained on a 10M RS image-text dataset to robustly capture geo-semantic cues, and RS-SQA uses a Simple Cross-Gating Block to integrate semantic and segmentation information, followed by a two-layer MLP to predict a 0–1 quality score. The RS-SQED dataset provides OA-based segmentation quality labels for eight methods across multiple RS datasets, enabling rigorous evaluation and method ranking. Across extensive experiments, RS-SQA outperforms classical NR-IQA baselines and many deep-learning NR-IQA models, and demonstrates practical value in recommending the best segmentation method for a given RSI, with ablations highlighting the contributions of data purification, model design, and loss terms.

Abstract

The complexity of scenes and variations in image quality result in significant variability in the performance of semantic segmentation methods of remote sensing imagery (RSI) in supervised real-world scenarios. This makes the evaluation of semantic segmentation quality in such scenarios an issue to be resolved. However, most of the existing evaluation metrics are developed based on expert-labeled object-level annotations, which are not applicable in such scenarios. To address this issue, we propose RS-SQA, an unsupervised quality assessment model for RSI semantic segmentation based on vision language model (VLM). This framework leverages a pre-trained RS VLM for semantic understanding and utilizes intermediate features from segmentation methods to extract implicit information about segmentation quality. Specifically, we introduce CLIP-RS, a large-scale pre-trained VLM trained with purified text to reduce textual noise and capture robust semantic information in the RS domain. Feature visualizations confirm that CLIP-RS can effectively differentiate between various levels of segmentation quality. Semantic features and low-level segmentation features are effectively integrated through a semantic-guided approach to enhance evaluation accuracy. To further support the development of RS semantic segmentation quality assessment, we present RS-SQED, a dedicated dataset sampled from four major RS semantic segmentation datasets and annotated with segmentation accuracy derived from the inference results of 8 representative segmentation methods. Experimental results on the established dataset demonstrate that RS-SQA significantly outperforms state-of-the-art quality assessment models. This provides essential support for predicting segmentation accuracy and high-quality semantic segmentation interpretation, offering substantial practical value.

Paper Structure

This paper contains 35 sections, 18 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The workflow for using the RS-SQA model to assist users in achieving optimal semantic segmentation. Stage 1: Evaluate the semantic segmentation quality score for each available method. Stage 2: Rank the methods based on their quality scores and select the top-performing one. Stage 3: Apply the recommended model to segment the input image.
  • Figure 2: Illustration of Our Framework. High-level semantic features are extracted from CLIP-RS visual encoder, while deep segmentation features are obtained from the RS semantic segmentation model and simplified via average pooling. The features from both branches are fused using a cross-gating block and then input into a quality prediction head to generate the quality score.
  • Figure 3: Data Purification Process of the CLIP-RS Dataset. (Left) The data purification workflow for CLIP-RS dataset. Stage 1: Train CLIP to obtain $\text{CLIP}_{\text{Sem}}$ with high-quality captions. Stage 2: Use the pre-trained $\text{CLIP}_{\text{Sem}}$ to calculate image-text similarity. Stage 3: Employ a remote sensing multi-modal large language model (MLLM) to regenerate captions for low-quality data. (Right) Examples of captioning results, showing initial low-quality image-text pairs and their corresponding purified captions.
  • Figure 4: Representative visualizations of features on remote sensing semantic segmentation datasets. From left to right are raw images, the features extracted by UNetFormer wang2022unetformer, MANet 9487010, DC-Swin wang2022novel, BANet wang2021transformer, A2FPN li2022a2, and the ground truth labels, respectively. Samples from the ISPRS Potsdam, ISPRS Vaihingen, LoveDA, UAVid, and FloodNet datasets are shown in (a)-(e), respectively.
  • Figure 5: Scatter plots between the predicted Overall Accuracy (OA) and the ground truth OA. The predicted OA is derived from models trained on ground truth segmented by UNetFormer wang2022unetformer, MANet 9487010, DC-Swin wang2022novel, AerialFormer rs16162930, BANet wang2021transformer, A2FPN li2022a2, ABCNet li2021abcnet, and UperNet(RSP-ViTAEv2-S) 9782149, respectively (corresponding to subplots a, b, c, d, e, f, g and h). From left to right, the results correspond to RS-SQED, ISPRS Vaihingen and Potsdam, LoveDA, UAVid, and FloodNet datasets.
  • ...and 1 more figures