Multi-Text Guided Few-Shot Semantic Segmentation
Qiang Jiao, Bin Yan, Yi Yang, Mengrui Shi, Qiang Zhang
TL;DR
MTGNet tackles incomplete foreground activation and large intra-class variation in few-shot semantic segmentation by leveraging multiple class-specific textual descriptions and cross-modal refinement. The model introduces three key components: Multi-Textual Prior Refinement (MTPR) to diversify and stabilize textual priors, Text Anchor Feature Fusion (TAFF) to align support and query features via semantic anchors, and Foreground Confidence-Weighted Attention (FCWA) to suppress noisy support regions. This dual-branch framework produces robust textual and visual priors, fused through a HDMNet-based decoder, yielding state-of-the-art performance on PASCAL-5^i and COCO-20^i in both 1- and 5-shot settings, with pronounced gains on folds with high intra-class variation. The approach demonstrates that enriching textual semantics and enforcing cross-modal consistency significantly enhances few-shot segmentation in challenging scenes.
Abstract
Recent CLIP-based few-shot semantic segmentation methods introduce class-level textual priors to assist segmentation by typically using a single prompt (e.g., a photo of class). However, these approaches often result in incomplete activation of target regions, as a single textual description cannot fully capture the semantic diversity of complex categories. Moreover, they lack explicit cross-modal interaction and are vulnerable to noisy support features, further degrading visual prior quality. To address these issues, we propose the Multi-Text Guided Few-Shot Semantic Segmentation Network (MTGNet), a dual-branch framework that enhances segmentation performance by fusing diverse textual prompts to refine textual priors and guide the cross-modal optimization of visual priors. Specifically, we design a Multi-Textual Prior Refinement (MTPR) module that suppresses interference and aggregates complementary semantic cues to enhance foreground activation and expand semantic coverage for structurally complex objects. We introduce a Text Anchor Feature Fusion (TAFF) module, which leverages multi-text embeddings as semantic anchors to facilitate the transfer of discriminative local prototypes from support images to query images, thereby improving semantic consistency and alleviating intra-class variations. Furthermore, a Foreground Confidence-Weighted Attention (FCWA) module is presented to enhance visual prior robustness by leveraging internal self-similarity within support foreground features. It adaptively down-weights inconsistent regions and effectively suppresses interference in the query segmentation process. Extensive experiments on standard FSS benchmarks validate the effectiveness of MTGNet. In the 1-shot setting, it achieves 76.8% mIoU on PASCAL-5i and 57.4% on COCO-20i, with notable improvements in folds exhibiting high intra-class variations.
