Table of Contents
Fetching ...

ZoRI: Towards Discriminative Zero-Shot Remote Sensing Instance Segmentation

Shiqi Huang, Shuting He, Bihan Wen

TL;DR

ZoRI addresses zero-shot remote sensing instance segmentation by bridging the domain gap between vision–language models and aerial imagery. It introduces a discrimination-enhanced classifier to sharpen semantic distinctions, a knowledge-maintained adaptation strategy to tailor CLIP features to the remote sensing domain while preserving cross-modal alignment, and a cache-bank–driven prior-injected prediction to cover intra-class variability with aerial prototypes. The approach yields state-of-the-art results on iSAID and NWPU-VHR-10 under both ZSRI and GZSRI settings, supported by extensive ablations demonstrating the additive benefits of DEC, KMA, and PIP. This work advances practical zero-shot segmentation in earth observation and offers a pathway for integrating aerial priors with powerful vision–language representations.

Abstract

Instance segmentation algorithms in remote sensing are typically based on conventional methods, limiting their application to seen scenarios and closed-set predictions. In this work, we propose a novel task called zero-shot remote sensing instance segmentation, aimed at identifying aerial objects that are absent from training data. Challenges arise when classifying aerial categories with high inter-class similarity and intra-class variance. Besides, the domain gap between vision-language models' pretraining datasets and remote sensing datasets hinders the zero-shot capabilities of the pretrained model when it is directly applied to remote sensing images. To address these challenges, we propose a $\textbf{Z}$ero-Sh$\textbf{o}$t $\textbf{R}$emote Sensing $\textbf{I}$nstance Segmentation framework, dubbed $\textbf{ZoRI}$. Our approach features a discrimination-enhanced classifier that uses refined textual embeddings to increase the awareness of class disparities. Instead of direct fine-tuning, we propose a knowledge-maintained adaptation strategy that decouples semantic-related information to preserve the pretrained vision-language alignment while adjusting features to capture remote sensing domain-specific visual cues. Additionally, we introduce a prior-injected prediction with cache bank of aerial visual prototypes to supplement the semantic richness of text embeddings and seamlessly integrate aerial representations, adapting to the remote sensing domain. We establish new experimental protocols and benchmarks, and extensive experiments convincingly demonstrate that ZoRI achieves the state-of-art performance on the zero-shot remote sensing instance segmentation task. Our code is available at https://github.com/HuangShiqi128/ZoRI.

ZoRI: Towards Discriminative Zero-Shot Remote Sensing Instance Segmentation

TL;DR

ZoRI addresses zero-shot remote sensing instance segmentation by bridging the domain gap between vision–language models and aerial imagery. It introduces a discrimination-enhanced classifier to sharpen semantic distinctions, a knowledge-maintained adaptation strategy to tailor CLIP features to the remote sensing domain while preserving cross-modal alignment, and a cache-bank–driven prior-injected prediction to cover intra-class variability with aerial prototypes. The approach yields state-of-the-art results on iSAID and NWPU-VHR-10 under both ZSRI and GZSRI settings, supported by extensive ablations demonstrating the additive benefits of DEC, KMA, and PIP. This work advances practical zero-shot segmentation in earth observation and offers a pathway for integrating aerial priors with powerful vision–language representations.

Abstract

Instance segmentation algorithms in remote sensing are typically based on conventional methods, limiting their application to seen scenarios and closed-set predictions. In this work, we propose a novel task called zero-shot remote sensing instance segmentation, aimed at identifying aerial objects that are absent from training data. Challenges arise when classifying aerial categories with high inter-class similarity and intra-class variance. Besides, the domain gap between vision-language models' pretraining datasets and remote sensing datasets hinders the zero-shot capabilities of the pretrained model when it is directly applied to remote sensing images. To address these challenges, we propose a ero-Sht emote Sensing nstance Segmentation framework, dubbed . Our approach features a discrimination-enhanced classifier that uses refined textual embeddings to increase the awareness of class disparities. Instead of direct fine-tuning, we propose a knowledge-maintained adaptation strategy that decouples semantic-related information to preserve the pretrained vision-language alignment while adjusting features to capture remote sensing domain-specific visual cues. Additionally, we introduce a prior-injected prediction with cache bank of aerial visual prototypes to supplement the semantic richness of text embeddings and seamlessly integrate aerial representations, adapting to the remote sensing domain. We establish new experimental protocols and benchmarks, and extensive experiments convincingly demonstrate that ZoRI achieves the state-of-art performance on the zero-shot remote sensing instance segmentation task. Our code is available at https://github.com/HuangShiqi128/ZoRI.

Paper Structure

This paper contains 40 sections, 7 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustration of zero-shot remote sensing instance segmentation, which transfers the learned semantic knowledge from seen classes, e.g., harbor and ship, to the unseen class, e.g., swimming pool.
  • Figure 2: (a) After refinement, the head of plane is highlighted and the activation map strictly follows its shape; For tennis court, the activation map is more focused and the missed one in the middle using original channels is also emphasized. (b) Classes such as basketball court and tennis court share similar color and shape, whereas instances from ship can have various appearances. (c) Remote sensing images are in bird's eye view while natural images are from ground-level prospective.
  • Figure 3: Overview of ZoRI framework. The CLIP image encoder is partially trained with knowledge-maintained adaptation (KMA) to extract backbone features, which are then fed into a mask generator to produce mask predictions and class embeddings. Discriminative-enhanced classifier (DEC) constructed by refining the original text embeddings is then used to classify class embeddings. During inference, cache bank is incorporated into CLIP zero-shot predictions to derive prior-injected predictions (PIP). The final classification probability is obtained through an ensemble approach.
  • Figure 4: Comparison of GZSRI results: (top row) ground truth, (middle row) FC-CLIP yu2023convolutions and (bottom row) our results. ZoRI successfully segments unseen objects missed by FC-CLIP due to domain gap, e.g., helicopter, swimming pool, and soccer ball field in the first three columns, and correctly identifies similar categories misclassified by FC-CLIP, e.g., tennis court and harbor in the last two columns due to class ambiguity. The proposed method ZoRI shows much better results by constructing a more discriminative model adapting to the remote sensing domain.
  • Figure 5: (Best viewed in color) t-SNE visualization of text embeddings. Crosses of the same color represent text embeddings produced using different prompt templates, with each class distinguished by a different color. The surrounding circle is added for better visualization, with the centroid representing the mean of text embeddings from the same class, and the radius being the maximum distance between any text embedding and the centroid.
  • ...and 2 more figures