Table of Contents
Fetching ...

Embedding Generalized Semantic Knowledge into Few-Shot Remote Sensing Segmentation

Yuyu Jia, Wei Huang, Junyu Gao, Qi Wang, Qiang Li

TL;DR

The paper tackles few-shot segmentation in remote sensing where intra-class variation hinders performance when relying on sparse visual cues. It introduces Holistic Semantic Embedding (HSE), which injects class description embeddings from language models into the feature extractor via two modules: Spatial Dense Interaction (SDI) and Global Content Modulation (GCM). SDI fuses CD information with spatial support features through self-attention, while GCM modulates features channel-wise to emphasize global category content, producing robust class-specific guidance for segmentation. Across the iSAID-$5^{i}$ benchmark, HSE achieves state-of-the-art results in both 1- and 5-shot settings with different language models, validating the benefit of incorporating general semantic knowledge into RS FSS.

Abstract

Few-shot segmentation (FSS) for remote sensing (RS) imagery leverages supporting information from limited annotated samples to achieve query segmentation of novel classes. Previous efforts are dedicated to mining segmentation-guiding visual cues from a constrained set of support samples. However, they still struggle to address the pronounced intra-class differences in RS images, as sparse visual cues make it challenging to establish robust class-specific representations. In this paper, we propose a holistic semantic embedding (HSE) approach that effectively harnesses general semantic knowledge, i.e., class description (CD) embeddings.Instead of the naive combination of CD embeddings and visual features for segmentation decoding, we investigate embedding the general semantic knowledge during the feature extraction stage.Specifically, in HSE, a spatial dense interaction module allows the interaction of visual support features with CD embeddings along the spatial dimension via self-attention.Furthermore, a global content modulation module efficiently augments the global information of the target category in both support and query features, thanks to the transformative fusion of visual features and CD embeddings.These two components holistically synergize general CD embeddings and visual cues, constructing a robust class-specific representation.Through extensive experiments on the standard FSS benchmark, the proposed HSE approach demonstrates superior performance compared to peer work, setting a new state-of-the-art.

Embedding Generalized Semantic Knowledge into Few-Shot Remote Sensing Segmentation

TL;DR

The paper tackles few-shot segmentation in remote sensing where intra-class variation hinders performance when relying on sparse visual cues. It introduces Holistic Semantic Embedding (HSE), which injects class description embeddings from language models into the feature extractor via two modules: Spatial Dense Interaction (SDI) and Global Content Modulation (GCM). SDI fuses CD information with spatial support features through self-attention, while GCM modulates features channel-wise to emphasize global category content, producing robust class-specific guidance for segmentation. Across the iSAID- benchmark, HSE achieves state-of-the-art results in both 1- and 5-shot settings with different language models, validating the benefit of incorporating general semantic knowledge into RS FSS.

Abstract

Few-shot segmentation (FSS) for remote sensing (RS) imagery leverages supporting information from limited annotated samples to achieve query segmentation of novel classes. Previous efforts are dedicated to mining segmentation-guiding visual cues from a constrained set of support samples. However, they still struggle to address the pronounced intra-class differences in RS images, as sparse visual cues make it challenging to establish robust class-specific representations. In this paper, we propose a holistic semantic embedding (HSE) approach that effectively harnesses general semantic knowledge, i.e., class description (CD) embeddings.Instead of the naive combination of CD embeddings and visual features for segmentation decoding, we investigate embedding the general semantic knowledge during the feature extraction stage.Specifically, in HSE, a spatial dense interaction module allows the interaction of visual support features with CD embeddings along the spatial dimension via self-attention.Furthermore, a global content modulation module efficiently augments the global information of the target category in both support and query features, thanks to the transformative fusion of visual features and CD embeddings.These two components holistically synergize general CD embeddings and visual cues, constructing a robust class-specific representation.Through extensive experiments on the standard FSS benchmark, the proposed HSE approach demonstrates superior performance compared to peer work, setting a new state-of-the-art.
Paper Structure (28 sections, 11 equations, 7 figures, 6 tables)

This paper contains 28 sections, 11 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Pronounced intra-class differences in remote sensing images.
  • Figure 2: Comparison between existing FSS methods and proposed HSE. (a) Many FSS algorithms adhere to a segmentation-guiding paradigm, primarily conducting research from two aspects: class-specific cues extraction (CCE) and segmentation decoder (SD). (b) Recent work, MIANet Yang_Chen_Feng_Huang, introduces general semantic knowledge and combines it with visual support prototypes for segmentation guidance. (c) We further explore a holistic semantic embedding (HSE) approach that exploits the capabilities of general semantic knowledge through spatial dense interaction and global semantic modulation.
  • Figure 3: Pipeline of the HSE method: it first extracts mid- and high-level support and query features, query prior masks, and CD embeddings in the initial feature extraction stage. To embed the general semantic knowledge from CD embeddings into visual cues and establish robust class-specific segmentation guidance, we design two sequential, complementary modules. The SDI module facilitates the spatial dense interaction between general semantic knowledge and individual-specific visual features. The GCM module enhances global content relevant to the target category in the support and query features through modulation coefficients. Finally, along with the modulated query feature and query prior mask, the constructed robust class-specific representation assumes the role of segmentation guidance inputted into the decoder, yielding the query prediction mask.
  • Figure 4: Structure of the SDI (a) and GCM (b) modules.
  • Figure 5: Ablation studies on the different designs of SDI and GCM modules under the $1$-shot setting.
  • ...and 2 more figures