Table of Contents
Fetching ...

Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic Segmentation

Wenhao Xu, Rongtao Xu, Changwei Wang, Shibiao Xu, Li Guo, Man Zhang, Xiaopeng Zhang

TL;DR

SPT-SEG presents a one-stage CLIP-based framework for zero-shot semantic segmentation that explicitly leverages spectral information. It introduces Spectral Prompt Tuning in the shallow visual layers to inject structure-aware cues and a Spectral Guided Decode Layer that combines high- and low-frequency features to sharpen pixel-level predictions for unseen classes. Across VOC 2012 and COCO-Stuff 164K, SPT-SEG consistently outperforms prior methods, with marked gains in unseen-class IoU and favorable ablations showing the importance of early-layer prompts and a multi-layer frequency-guided decoder. The approach offers a practical, efficient solution for dense zero-shot segmentation, reducing reliance on proposal generation while enhancing cross-modal alignment between text and pixels.

Abstract

Recently, CLIP has found practical utility in the domain of pixel-level zero-shot segmentation tasks. The present landscape features two-stage methodologies beset by issues such as intricate pipelines and elevated computational costs. While current one-stage approaches alleviate these concerns and incorporate Visual Prompt Training (VPT) to uphold CLIP's generalization capacity, they still fall short in fully harnessing CLIP's potential for pixel-level unseen class demarcation and precise pixel predictions. To further stimulate CLIP's zero-shot dense prediction capability, we propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from image to pixel. Specifically, we initially introduce Spectral Prompt Tuning (SPT), incorporating spectral prompts into the CLIP visual encoder's shallow layers to capture structural intricacies of images, thereby enhancing comprehension of unseen classes. Subsequently, we introduce the Spectral Guided Decoder (SGD), utilizing both high and low-frequency information to steer the network's spatial focus towards more prominent classification features, enabling precise pixel-level prediction outcomes. Through extensive experiments on two public datasets, we demonstrate the superiority of our method over state-of-the-art approaches, performing well across all classes and particularly excelling in handling unseen classes. Code is available at:https://github.com/clearxu/SPT.

Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic Segmentation

TL;DR

SPT-SEG presents a one-stage CLIP-based framework for zero-shot semantic segmentation that explicitly leverages spectral information. It introduces Spectral Prompt Tuning in the shallow visual layers to inject structure-aware cues and a Spectral Guided Decode Layer that combines high- and low-frequency features to sharpen pixel-level predictions for unseen classes. Across VOC 2012 and COCO-Stuff 164K, SPT-SEG consistently outperforms prior methods, with marked gains in unseen-class IoU and favorable ablations showing the importance of early-layer prompts and a multi-layer frequency-guided decoder. The approach offers a practical, efficient solution for dense zero-shot segmentation, reducing reliance on proposal generation while enhancing cross-modal alignment between text and pixels.

Abstract

Recently, CLIP has found practical utility in the domain of pixel-level zero-shot segmentation tasks. The present landscape features two-stage methodologies beset by issues such as intricate pipelines and elevated computational costs. While current one-stage approaches alleviate these concerns and incorporate Visual Prompt Training (VPT) to uphold CLIP's generalization capacity, they still fall short in fully harnessing CLIP's potential for pixel-level unseen class demarcation and precise pixel predictions. To further stimulate CLIP's zero-shot dense prediction capability, we propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from image to pixel. Specifically, we initially introduce Spectral Prompt Tuning (SPT), incorporating spectral prompts into the CLIP visual encoder's shallow layers to capture structural intricacies of images, thereby enhancing comprehension of unseen classes. Subsequently, we introduce the Spectral Guided Decoder (SGD), utilizing both high and low-frequency information to steer the network's spatial focus towards more prominent classification features, enabling precise pixel-level prediction outcomes. Through extensive experiments on two public datasets, we demonstrate the superiority of our method over state-of-the-art approaches, performing well across all classes and particularly excelling in handling unseen classes. Code is available at:https://github.com/clearxu/SPT.
Paper Structure (20 sections, 11 equations, 4 figures, 4 tables)

This paper contains 20 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: (a) Our SPT-SEG method demonstrates outstanding performance across all classes. (b) While yielding favorable results within the seen classes, it exhibits relatively poorer performance in the unseen classes. (c) Its performance is unsatisfactory across all classes.
  • Figure 2: Overview of our proposed SPT-SEG. The main contribution of our work lies in two simple but effective designs (Red marks a,b in the figure): (a) Spectral prompt tuning which adds learnable spectral prompts to the first two layers of the CLIP's visual encoder; (b) Spectral guided decoder which utilizes high- and low-frequency feature information to guide the text to match with pixels, and decodes the predicted results.
  • Figure 3: Overview of our proposed Spectral-Prompt Tuning. During training on downstream tasks, only the parameters of prompts and the linear head are updated while the whole Transformer encoder is frozen.
  • Figure 4: Qualitative results on COCO-Stuff 164K. (a) are the original testing images; (b) are the ground truths of each image.(c) represent the performance of ZegCLIP; (d) are the visualization results of our proposed SPT-SEG. Note that we have highlighted prominent regions using yellow arrows and marked other significant areas with yellow stars for emphasis.