Table of Contents
Fetching ...

Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation

Balamurali Murugesan, Rukhshanda Hussain, Rajarshi Bhattacharya, Ismail Ben Ayed, Jose Dolz

TL;DR

This work investigates prompting as a learning signal for weakly supervised semantic segmentation by analyzing how text prompts in vision–language models influence class activation maps. The authors reveal that substituting the input class token, rather than optimizing the prompt context, yields larger CAM improvements, and that the ground-truth class is not always the most correlated prompt for a given image. They introduce POLE, a simple yet effective method that selects the most correlated class name per image and augments it with lightweight adaptors to refine multimodal embeddings, achieving state-of-the-art results on Pascal VOC 2012. The approach leverages synonym-based prompts from diverse sources, showing that carefully designed, semantically related prompts can significantly boost pseudo-label quality and segmentation accuracy without heavy annotation costs. Overall, POLE demonstrates the strong potential of prompt learning for WSSS and highlights the importance of prompt design in vision-language fine-tuning.

Abstract

Recently, CLIP-based approaches have exhibited remarkable performance on generalization and few-shot learning tasks, fueled by the power of contrastive language-vision pre-training. In particular, prompt tuning has emerged as an effective strategy to adapt the pre-trained language-vision models to downstream tasks by employing task-related textual tokens. Motivated by this progress, in this work we question whether other fundamental problems, such as weakly supervised semantic segmentation (WSSS), can benefit from prompt tuning. Our findings reveal two interesting observations that shed light on the impact of prompt tuning on WSSS. First, modifying only the class token of the text prompt results in a greater impact on the Class Activation Map (CAM), compared to arguably more complex strategies that optimize the context. And second, the class token associated with the image ground truth does not necessarily correspond to the category that yields the best CAM. Motivated by these observations, we introduce a novel approach based on a PrOmpt cLass lEarning (POLE) strategy. Through extensive experiments we demonstrate that our simple, yet efficient approach achieves SOTA performance in a well-known WSSS benchmark. These results highlight not only the benefits of language-vision models in WSSS but also the potential of prompt learning for this problem. The code is available at https://github.com/rB080/WSS_POLE.

Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation

TL;DR

This work investigates prompting as a learning signal for weakly supervised semantic segmentation by analyzing how text prompts in vision–language models influence class activation maps. The authors reveal that substituting the input class token, rather than optimizing the prompt context, yields larger CAM improvements, and that the ground-truth class is not always the most correlated prompt for a given image. They introduce POLE, a simple yet effective method that selects the most correlated class name per image and augments it with lightweight adaptors to refine multimodal embeddings, achieving state-of-the-art results on Pascal VOC 2012. The approach leverages synonym-based prompts from diverse sources, showing that carefully designed, semantically related prompts can significantly boost pseudo-label quality and segmentation accuracy without heavy annotation costs. Overall, POLE demonstrates the strong potential of prompt learning for WSSS and highlights the importance of prompt design in vision-language fine-tuning.

Abstract

Recently, CLIP-based approaches have exhibited remarkable performance on generalization and few-shot learning tasks, fueled by the power of contrastive language-vision pre-training. In particular, prompt tuning has emerged as an effective strategy to adapt the pre-trained language-vision models to downstream tasks by employing task-related textual tokens. Motivated by this progress, in this work we question whether other fundamental problems, such as weakly supervised semantic segmentation (WSSS), can benefit from prompt tuning. Our findings reveal two interesting observations that shed light on the impact of prompt tuning on WSSS. First, modifying only the class token of the text prompt results in a greater impact on the Class Activation Map (CAM), compared to arguably more complex strategies that optimize the context. And second, the class token associated with the image ground truth does not necessarily correspond to the category that yields the best CAM. Motivated by these observations, we introduce a novel approach based on a PrOmpt cLass lEarning (POLE) strategy. Through extensive experiments we demonstrate that our simple, yet efficient approach achieves SOTA performance in a well-known WSSS benchmark. These results highlight not only the benefits of language-vision models in WSSS but also the potential of prompt learning for this problem. The code is available at https://github.com/rB080/WSS_POLE.
Paper Structure (14 sections, 7 equations, 8 figures, 5 tables)

This paper contains 14 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Impact of the input text prompt on the generation of class activation maps (CAMs). Employing the ground truth categorical label as [CLS] token (second column) does not necessarily result in the best initial CAMs. Furthermore, even though complex techniques to optimize the [CTX] tokens, such as CoOp zhou2022conditional (third column) may improve the CAMs, we have observed that simply modifying the ground truth class in the [CLS] token by a higher correlated synonym leads to improvements in the identified class-related regions (fourth column).
  • Figure 2: Proposed Weakly Supervised Segmentation approach.1) Class activation maps are generated for an input image $\mathbf{X}$. 2) CLIP pre-trained visual and text encoders ($f_{\theta}$ and $f_{\theta}$) are leveraged to find the category name [CLS] presenting the highest correlation with the image $\mathbf{M}_k$, the result of multiplying the input image $\mathbf{X}$ and its corresponding CAM $\mathbf{P}_k$. 3) With the [CLS] token selected, we generate the input text prompt $\textbf{t}^o_{kb}$ to the Cross-Language Image Matching (CLIMS) learning framework.
  • Figure 3: Impact of the Corpus choice and number of synonyms selected. ChatGPT offers the richer variety of synonyms, yielding the best results across other corpus. Furthermore, increasing the number of synonyms (from $2$ up to $4$) further improves the results. Note that the number of synonyms includes the categorical name from the ground truth and the requested close synonyms.
  • Figure 4: What does CLIP think about the best [CLS]? Is the ground truth category chosen everytime? How likely is it that CLIP will select something different? The plot summarises the percentage of cases where the ground truth category was chosen for an instance of that class. Thus, an inward point on the radial plot indicates that the number of instances where the ground truth category was chosen as the best [CLS] token is considerably low.
  • Figure 5: Qualitative results of the initial class activation maps. Green dotted lines ellipses are used to indicate missed regions by previous approaches (original CAMs and CLIMS xie2022clims) compared to the proposed method. No refinement on the obtained CAMs is done (e.g., RW) to better illustrate the impact of our approach.
  • ...and 3 more figures