Table of Contents
Fetching ...

Beyond Mask: Rethinking Guidance Types in Few-shot Segmentation

Shijie Chang, Youwei Pang, Xiaoqi Zhao, Lihe Zhang, Huchuan Lu

TL;DR

This work reframes few-shot segmentation by systematically analyzing seven guidance patterns and introducing UniFSS, a universal vision-language framework that fuses visual and textual cues via CLIP embeddings. Its four core components—Visual-Textual Correlation (VTC), High-level Spatial Correction (HSCU), Multi-Scale Correlation Aggregation (MSCA), and a Cross-modal Decoder with Embedding Interactive Unit (EIU)—enable robust cross-modal matching across image, mask, box, and text guidance. Empirical results on PASCAL-$5^i$, COCO-$20^i$, FSS-1000, and iSAID-$5^i$ demonstrate state-of-the-art performance across seven task patterns, with weakly supervised box guidance sometimes surpassing finely annotated masks. The approach highlights the practical value of flexible, multi-granularity prompts and lays groundwork for unified, prompt-driven segmentation models with broad applicability.

Abstract

Existing few-shot segmentation (FSS) methods mainly focus on prototype feature generation and the query-support matching mechanism. As a crucial prompt for generating prototype features, the pair of image-mask types in the support set has become the default setting. However, various types such as image, text, box, and mask all can provide valuable information regarding the objects in context, class, localization, and shape appearance. Existing work focuses on specific combinations of guidance, leading FSS into different research branches. Rethinking guidance types in FSS is expected to explore the efficient joint representation of the coupling between the support set and query set, giving rise to research trends in the weakly or strongly annotated guidance to meet the customized requirements of practical users. In this work, we provide the generalized FSS with seven guidance paradigms and develop a universal vision-language framework (UniFSS) to integrate prompts from text, mask, box, and image. Leveraging the advantages of large-scale pre-training vision-language models in textual and visual embeddings, UniFSS proposes high-level spatial correction and embedding interactive units to overcome the semantic ambiguity drawbacks typically encountered by pure visual matching methods when facing intra-class appearance diversities. Extensive experiments show that UniFSS significantly outperforms the state-of-the-art methods. Notably, the weakly annotated class-aware box paradigm even surpasses the finely annotated mask paradigm.

Beyond Mask: Rethinking Guidance Types in Few-shot Segmentation

TL;DR

This work reframes few-shot segmentation by systematically analyzing seven guidance patterns and introducing UniFSS, a universal vision-language framework that fuses visual and textual cues via CLIP embeddings. Its four core components—Visual-Textual Correlation (VTC), High-level Spatial Correction (HSCU), Multi-Scale Correlation Aggregation (MSCA), and a Cross-modal Decoder with Embedding Interactive Unit (EIU)—enable robust cross-modal matching across image, mask, box, and text guidance. Empirical results on PASCAL-, COCO-, FSS-1000, and iSAID- demonstrate state-of-the-art performance across seven task patterns, with weakly supervised box guidance sometimes surpassing finely annotated masks. The approach highlights the practical value of flexible, multi-granularity prompts and lays groundwork for unified, prompt-driven segmentation models with broad applicability.

Abstract

Existing few-shot segmentation (FSS) methods mainly focus on prototype feature generation and the query-support matching mechanism. As a crucial prompt for generating prototype features, the pair of image-mask types in the support set has become the default setting. However, various types such as image, text, box, and mask all can provide valuable information regarding the objects in context, class, localization, and shape appearance. Existing work focuses on specific combinations of guidance, leading FSS into different research branches. Rethinking guidance types in FSS is expected to explore the efficient joint representation of the coupling between the support set and query set, giving rise to research trends in the weakly or strongly annotated guidance to meet the customized requirements of practical users. In this work, we provide the generalized FSS with seven guidance paradigms and develop a universal vision-language framework (UniFSS) to integrate prompts from text, mask, box, and image. Leveraging the advantages of large-scale pre-training vision-language models in textual and visual embeddings, UniFSS proposes high-level spatial correction and embedding interactive units to overcome the semantic ambiguity drawbacks typically encountered by pure visual matching methods when facing intra-class appearance diversities. Extensive experiments show that UniFSS significantly outperforms the state-of-the-art methods. Notably, the weakly annotated class-aware box paradigm even surpasses the finely annotated mask paradigm.
Paper Structure (21 sections, 5 equations, 7 figures, 9 tables)

This paper contains 21 sections, 5 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: All task patterns of few-shot segmentation corresponding to different combinations between image $I$, mask $M$, box $B$, and text $T$, and their mIoU performance comparison on PASCAL-$5^i$shaban2017one. Results of previous works are from: min2021hypercorrelation for ①, moon2023msi for ② shi2022dense for ③, shuai2023pgmanet for {④, ⑤, ⑥}, and liu2023delving for ⑦. The performance results are rounded for easier representation.
  • Figure 2: Overview of the proposed UniFSS framework. The proposed framework consists of four components, i.e., 1) visual and textual correlation (Sec. \ref{['sec:corr_computation']}), 2) high-level spatial correction unit (Sec. \ref{['sec:hscu']}), 3) multi-scale correlation aggregation unit (Sec. \ref{['sec:ms_corragg']}) and 4) cross-modal decoder (Sec. \ref{['sec:decoder']}).
  • Figure 3: Details of our cross-modal decoder with the proposed embedding interactive unit (EIU). "Up": bilinear interpolation. "$\otimes$": Hadamard product.
  • Figure 4: 7 task patterns with different guidance types. For simplicity, we omit the text information in the guidance types of class-aware FSS.
  • Figure 5: Qualitative comparison with the baseline on some samples from PASCAL-$5^i$ (Left) and COCO-$20^i$ (Right). UniFSS can better overcome the intra-class diversities.
  • ...and 2 more figures