Table of Contents
Fetching ...

SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images

Kaiyu Li, Ruixun Liu, Xiangyong Cao, Xueru Bai, Feng Zhou, Deyu Meng, Zhi Wang

TL;DR

SegEarth-OV introduces a training-free open-vocabulary segmentation framework tailored for remote sensing, leveraging SimFeatUp to upsample low-resolution CLIP features and a simple global-bias subtraction to improve dense predictions. By training SimFeatUp on unlabeled RS data and applying a CLS-aware bias correction, the method attains state-of-the-art performance across 17 RS datasets for semantic segmentation, building and road extraction, and flood detection. The work demonstrates the viability of RS-focused OVSS with a lightweight, plug-and-play approach that generalizes across RS modalities and scales, highlighting the potential of open-vocabulary perception in earth observation. Overall, SegEarth-OV delivers significant gains over natural-image–oriented OVSS baselines and provides a practical, training-free path toward scalable RS segmentation.

Abstract

Remote sensing image plays an irreplaceable role in fields such as agriculture, water resources, military, and disaster relief. Pixel-level interpretation is a critical aspect of remote sensing image applications; however, a prevalent limitation remains the need for extensive manual annotation. For this, we try to introduce open-vocabulary semantic segmentation (OVSS) into the remote sensing context. However, due to the sensitivity of remote sensing images to low-resolution features, distorted target shapes and ill-fitting boundaries are exhibited in the prediction mask. To tackle this issue, we propose a simple and general upsampler, SimFeatUp, to restore lost spatial information in deep features in a training-free style. Further, based on the observation of the abnormal response of local patch tokens to [CLS] token in CLIP, we propose to execute a straightforward subtraction operation to alleviate the global bias in patch tokens. Extensive experiments are conducted on 17 remote sensing datasets spanning semantic segmentation, building extraction, road detection, and flood detection tasks. Our method achieves an average of 5.8%, 8.2%, 4.0%, and 15.3% improvement over state-of-the-art methods on 4 tasks. All codes are released. \url{https://earth-insights.github.io/SegEarth-OV}

SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images

TL;DR

SegEarth-OV introduces a training-free open-vocabulary segmentation framework tailored for remote sensing, leveraging SimFeatUp to upsample low-resolution CLIP features and a simple global-bias subtraction to improve dense predictions. By training SimFeatUp on unlabeled RS data and applying a CLS-aware bias correction, the method attains state-of-the-art performance across 17 RS datasets for semantic segmentation, building and road extraction, and flood detection. The work demonstrates the viability of RS-focused OVSS with a lightweight, plug-and-play approach that generalizes across RS modalities and scales, highlighting the potential of open-vocabulary perception in earth observation. Overall, SegEarth-OV delivers significant gains over natural-image–oriented OVSS baselines and provides a practical, training-free path toward scalable RS segmentation.

Abstract

Remote sensing image plays an irreplaceable role in fields such as agriculture, water resources, military, and disaster relief. Pixel-level interpretation is a critical aspect of remote sensing image applications; however, a prevalent limitation remains the need for extensive manual annotation. For this, we try to introduce open-vocabulary semantic segmentation (OVSS) into the remote sensing context. However, due to the sensitivity of remote sensing images to low-resolution features, distorted target shapes and ill-fitting boundaries are exhibited in the prediction mask. To tackle this issue, we propose a simple and general upsampler, SimFeatUp, to restore lost spatial information in deep features in a training-free style. Further, based on the observation of the abnormal response of local patch tokens to [CLS] token in CLIP, we propose to execute a straightforward subtraction operation to alleviate the global bias in patch tokens. Extensive experiments are conducted on 17 remote sensing datasets spanning semantic segmentation, building extraction, road detection, and flood detection tasks. Our method achieves an average of 5.8%, 8.2%, 4.0%, and 15.3% improvement over state-of-the-art methods on 4 tasks. All codes are released. \url{https://earth-insights.github.io/SegEarth-OV}
Paper Structure (19 sections, 10 equations, 10 figures, 7 tables)

This paper contains 19 sections, 10 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Visualization and performance of SegEarth-OV on open-vocabulary semantic segmentation of remote sensing images. We evaluate on 17 remote sensing datasets (including semantic segmentation, building extraction, road extraction, and flood detection tasks), and our SegEarth-OV consistently generates high-quality segmentation masks.
  • Figure 2: Limitations of state-of-the-art OVSS methods in remote sensing images, the two predictions on the left present distorted target shapes and ill-fitting boundaries. (best viewed digitally with zoom, especially for the edges of the object)
  • Figure 3: Illustration of the proposed method. (a) is the training process of SimFeatUp. CLIP is frozen and only SimFeatUp is useful in reasoning. (b) is the reasoning process of SegEarth-OV. The LR feature maps from CLIP are upsampled by SimFeatUp and then the [CLS] token is subtracted to alleviate global bias. For better presentation, the color renderings follow fu2024featup.
  • Figure 4: Comparison of with and without image reconstruction loss (\ref{['eq:img_rec_loss']}). the LR prediction is obtained directly using the output of CLIP (without bilinear interpolation). Color: building, tree, cropland, grass.
  • Figure 5: Comparison of before and after alleviating the global bias. (a) is the similarity map of patch tokens and cls tokens, some “non-building” regions also present high response, (b) is the original RGB image. Note that the right-hand histograms stretch the raw values for better presentation.
  • ...and 5 more figures