Table of Contents
Fetching ...

Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation

Jin Wang, Bingfeng Zhang, Jian Pang, Honglong Chen, Weifeng Liu

TL;DR

This work tackles the challenge of few-shot segmentation by addressing the limitations of traditional, ImageNet-pretrained priors that bias localization toward base classes. It introduces PI-CLIP, a training-free framework that leverages CLIP's visual-text and visual-visual alignments to generate more accurate and general priors for segmentation. The approach comprises three components: Visual-Text Prior (VTP) for precise, text-guided localization, Visual-Visual Prior (VVP) for broader, cross-image matching, and Prior Information Refinement (PIR) using a high-order attention matrix derived from CLIP to refine VTP and preserve global structure. Empirical results on PASCAL-5i and COCO-20i show state-of-the-art performance across 1-shot and 5-shot settings, with strong generalization across baselines, demonstrating that training-free CLIP priors can substantially enhance few-shot segmentation without additional training.

Abstract

Few-shot segmentation remains challenging due to the limitations of its labeling information for unseen classes. Most previous approaches rely on extracting high-level feature maps from the frozen visual encoder to compute the pixel-wise similarity as a key prior guidance for the decoder. However, such a prior representation suffers from coarse granularity and poor generalization to new classes since these high-level feature maps have obvious category bias. In this work, we propose to replace the visual prior representation with the visual-text alignment capacity to capture more reliable guidance and enhance the model generalization. Specifically, we design two kinds of training-free prior information generation strategy that attempts to utilize the semantic alignment capability of the Contrastive Language-Image Pre-training model (CLIP) to locate the target class. Besides, to acquire more accurate prior guidance, we build a high-order relationship of attention maps and utilize it to refine the initial prior information. Experiments on both the PASCAL-5{i} and COCO-20{i} datasets show that our method obtains a clearly substantial improvement and reaches the new state-of-the-art performance.

Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation

TL;DR

This work tackles the challenge of few-shot segmentation by addressing the limitations of traditional, ImageNet-pretrained priors that bias localization toward base classes. It introduces PI-CLIP, a training-free framework that leverages CLIP's visual-text and visual-visual alignments to generate more accurate and general priors for segmentation. The approach comprises three components: Visual-Text Prior (VTP) for precise, text-guided localization, Visual-Visual Prior (VVP) for broader, cross-image matching, and Prior Information Refinement (PIR) using a high-order attention matrix derived from CLIP to refine VTP and preserve global structure. Empirical results on PASCAL-5i and COCO-20i show state-of-the-art performance across 1-shot and 5-shot settings, with strong generalization across baselines, demonstrating that training-free CLIP priors can substantially enhance few-shot segmentation without additional training.

Abstract

Few-shot segmentation remains challenging due to the limitations of its labeling information for unseen classes. Most previous approaches rely on extracting high-level feature maps from the frozen visual encoder to compute the pixel-wise similarity as a key prior guidance for the decoder. However, such a prior representation suffers from coarse granularity and poor generalization to new classes since these high-level feature maps have obvious category bias. In this work, we propose to replace the visual prior representation with the visual-text alignment capacity to capture more reliable guidance and enhance the model generalization. Specifically, we design two kinds of training-free prior information generation strategy that attempts to utilize the semantic alignment capability of the Contrastive Language-Image Pre-training model (CLIP) to locate the target class. Besides, to acquire more accurate prior guidance, we build a high-order relationship of attention maps and utilize it to refine the initial prior information. Experiments on both the PASCAL-5{i} and COCO-20{i} datasets show that our method obtains a clearly substantial improvement and reaches the new state-of-the-art performance.
Paper Structure (15 sections, 11 equations, 4 figures, 4 tables)

This paper contains 15 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of prior information. (a) Support images with ground-truth masks; (b) Query images with ground-truth masks; (c) Prior information from previous approaches generated based on the frozen ImageNet deng2009imagenet weights, which are biased towards some classes, such as the 'Person' class; (d) Our prior information, which is generated utilizing the text and visual alignment ability of the frozen CLIP model. Our prior information is finer-grained and mitigates the bias of the class.
  • Figure 2: Overview of our proposed PI-CLIP for few-shot segmentation. We design a group of text prompts for a certain class to attract more attention to target regions. The VTP module generates the visual-text prior information by aligning the visual information and text information with the help of softmax-GradCAM. The VVP module generates the visual-visual prior information by a pixel-level similarity calculation. The PIR module is proposed to refine the coarse initial prior information. Finally, the original prior information in the existing few-shot model is directly replaced by VVP and refined VTP, after passing the decoder, the final prediction is generated.
  • Figure 3: Qualitative results of the proposed PI-CLIP and baseline (HDMNet peng2023hierarchical) approach under 1-shot setting. Each row from top to bottom represents the support images with ground-truth (GT) masks (green), query images with GT masks (blue), baseline results (red), and our results (yellow), respectively.
  • Figure 4: Visualization of the different prior information generated by our proposed method. The left is sampled from PASCAL-5$^{i}$shaban2017one and the right is selected from COCO-20$^{i}$nguyen2019feature. Each row from top to bottom represents the query image, initial visual-visual prior information, refined visual-visual prior information, initial visual-text prior information and refined visual-text prior information. The $P_{vv}$ has more general localization regions and the $P_{vt}$ has more local target regions. With the refinement of the designed high-order matrix, more accurate prior information can be extracted.