Table of Contents
Fetching ...

Multi-Text Guided Few-Shot Semantic Segmentation

Qiang Jiao, Bin Yan, Yi Yang, Mengrui Shi, Qiang Zhang

TL;DR

MTGNet tackles incomplete foreground activation and large intra-class variation in few-shot semantic segmentation by leveraging multiple class-specific textual descriptions and cross-modal refinement. The model introduces three key components: Multi-Textual Prior Refinement (MTPR) to diversify and stabilize textual priors, Text Anchor Feature Fusion (TAFF) to align support and query features via semantic anchors, and Foreground Confidence-Weighted Attention (FCWA) to suppress noisy support regions. This dual-branch framework produces robust textual and visual priors, fused through a HDMNet-based decoder, yielding state-of-the-art performance on PASCAL-5^i and COCO-20^i in both 1- and 5-shot settings, with pronounced gains on folds with high intra-class variation. The approach demonstrates that enriching textual semantics and enforcing cross-modal consistency significantly enhances few-shot segmentation in challenging scenes.

Abstract

Recent CLIP-based few-shot semantic segmentation methods introduce class-level textual priors to assist segmentation by typically using a single prompt (e.g., a photo of class). However, these approaches often result in incomplete activation of target regions, as a single textual description cannot fully capture the semantic diversity of complex categories. Moreover, they lack explicit cross-modal interaction and are vulnerable to noisy support features, further degrading visual prior quality. To address these issues, we propose the Multi-Text Guided Few-Shot Semantic Segmentation Network (MTGNet), a dual-branch framework that enhances segmentation performance by fusing diverse textual prompts to refine textual priors and guide the cross-modal optimization of visual priors. Specifically, we design a Multi-Textual Prior Refinement (MTPR) module that suppresses interference and aggregates complementary semantic cues to enhance foreground activation and expand semantic coverage for structurally complex objects. We introduce a Text Anchor Feature Fusion (TAFF) module, which leverages multi-text embeddings as semantic anchors to facilitate the transfer of discriminative local prototypes from support images to query images, thereby improving semantic consistency and alleviating intra-class variations. Furthermore, a Foreground Confidence-Weighted Attention (FCWA) module is presented to enhance visual prior robustness by leveraging internal self-similarity within support foreground features. It adaptively down-weights inconsistent regions and effectively suppresses interference in the query segmentation process. Extensive experiments on standard FSS benchmarks validate the effectiveness of MTGNet. In the 1-shot setting, it achieves 76.8% mIoU on PASCAL-5i and 57.4% on COCO-20i, with notable improvements in folds exhibiting high intra-class variations.

Multi-Text Guided Few-Shot Semantic Segmentation

TL;DR

MTGNet tackles incomplete foreground activation and large intra-class variation in few-shot semantic segmentation by leveraging multiple class-specific textual descriptions and cross-modal refinement. The model introduces three key components: Multi-Textual Prior Refinement (MTPR) to diversify and stabilize textual priors, Text Anchor Feature Fusion (TAFF) to align support and query features via semantic anchors, and Foreground Confidence-Weighted Attention (FCWA) to suppress noisy support regions. This dual-branch framework produces robust textual and visual priors, fused through a HDMNet-based decoder, yielding state-of-the-art performance on PASCAL-5^i and COCO-20^i in both 1- and 5-shot settings, with pronounced gains on folds with high intra-class variation. The approach demonstrates that enriching textual semantics and enforcing cross-modal consistency significantly enhances few-shot segmentation in challenging scenes.

Abstract

Recent CLIP-based few-shot semantic segmentation methods introduce class-level textual priors to assist segmentation by typically using a single prompt (e.g., a photo of class). However, these approaches often result in incomplete activation of target regions, as a single textual description cannot fully capture the semantic diversity of complex categories. Moreover, they lack explicit cross-modal interaction and are vulnerable to noisy support features, further degrading visual prior quality. To address these issues, we propose the Multi-Text Guided Few-Shot Semantic Segmentation Network (MTGNet), a dual-branch framework that enhances segmentation performance by fusing diverse textual prompts to refine textual priors and guide the cross-modal optimization of visual priors. Specifically, we design a Multi-Textual Prior Refinement (MTPR) module that suppresses interference and aggregates complementary semantic cues to enhance foreground activation and expand semantic coverage for structurally complex objects. We introduce a Text Anchor Feature Fusion (TAFF) module, which leverages multi-text embeddings as semantic anchors to facilitate the transfer of discriminative local prototypes from support images to query images, thereby improving semantic consistency and alleviating intra-class variations. Furthermore, a Foreground Confidence-Weighted Attention (FCWA) module is presented to enhance visual prior robustness by leveraging internal self-similarity within support foreground features. It adaptively down-weights inconsistent regions and effectively suppresses interference in the query segmentation process. Extensive experiments on standard FSS benchmarks validate the effectiveness of MTGNet. In the 1-shot setting, it achieves 76.8% mIoU on PASCAL-5i and 57.4% on COCO-20i, with notable improvements in folds exhibiting high intra-class variations.

Paper Structure

This paper contains 28 sections, 18 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Comparisons of our MTGNet with previous single-text-based FSS methods. (a) Activation maps. Compared to single-text methods (middle row), which highlight only the most distinctive parts of the target object, our multi-text strategy (bottom row) activates a broader and more complete semantic region. (b) Previous single-text FSS methods. (c) Our proposed multi-text FSS method (MTGNet).
  • Figure 2: Qualitative comparison of visual priors. Each row (from top to bottom) represents: support images with ground-truth (GT) masks overlaid in blue, query images with GT masks in purple, visual priors $P_{qs}$ generated by previous methods, and the refined visual priors $P_{qs}^{ref}$ by our proposed MGTNet, which demonstrate more complete and accurate foreground activation.
  • Figure 3: Overview of the proposed MTGNet pipeline. MTGNet adopts a dual-branch prior refinement strategy. In the textual branch, the Multi-Textual Prior Refinement (MTPR) module refines $P_{qt}^{init}$ into $P_{qt}^{glb}$ and $P_{qt}^{acc}$ via threshold-based propagation and multi-text aggregation. In the visual branch, the Text Anchor Feature Fusion (TAFF) module extracts regional prototypes from $F_s$ guided by $P_{st}^{init}$ and aligns them with $F_q$ via $P_{qt}^{init}$. The Foreground Confidence-Weighted Attention (FCWA) module then generates a robust visual prior $P_{qs}^{ref}$. Finally, the refined priors $P_{qt}^{glb}$, $P_{qt}^{acc}$, and $P_{qs}^{ref}$ are fused and decoded by an HDMNet-based peng2023hierarchical decoder to produce the final segmentation map $P_{final}$.
  • Figure 4: Visualization for the t-SNE van2008visualizing embeddings for the constructed textual descriptions.
  • Figure 5: The structure of our proposed Text-Anchor Feature Fusion (TAFF) module.
  • ...and 5 more figures