Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation
Jin Wang, Bingfeng Zhang, Jian Pang, Honglong Chen, Weifeng Liu
TL;DR
This work tackles the challenge of few-shot segmentation by addressing the limitations of traditional, ImageNet-pretrained priors that bias localization toward base classes. It introduces PI-CLIP, a training-free framework that leverages CLIP's visual-text and visual-visual alignments to generate more accurate and general priors for segmentation. The approach comprises three components: Visual-Text Prior (VTP) for precise, text-guided localization, Visual-Visual Prior (VVP) for broader, cross-image matching, and Prior Information Refinement (PIR) using a high-order attention matrix derived from CLIP to refine VTP and preserve global structure. Empirical results on PASCAL-5i and COCO-20i show state-of-the-art performance across 1-shot and 5-shot settings, with strong generalization across baselines, demonstrating that training-free CLIP priors can substantially enhance few-shot segmentation without additional training.
Abstract
Few-shot segmentation remains challenging due to the limitations of its labeling information for unseen classes. Most previous approaches rely on extracting high-level feature maps from the frozen visual encoder to compute the pixel-wise similarity as a key prior guidance for the decoder. However, such a prior representation suffers from coarse granularity and poor generalization to new classes since these high-level feature maps have obvious category bias. In this work, we propose to replace the visual prior representation with the visual-text alignment capacity to capture more reliable guidance and enhance the model generalization. Specifically, we design two kinds of training-free prior information generation strategy that attempts to utilize the semantic alignment capability of the Contrastive Language-Image Pre-training model (CLIP) to locate the target class. Besides, to acquire more accurate prior guidance, we build a high-order relationship of attention maps and utilize it to refine the initial prior information. Experiments on both the PASCAL-5{i} and COCO-20{i} datasets show that our method obtains a clearly substantial improvement and reaches the new state-of-the-art performance.
