Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation
Balamurali Murugesan, Rukhshanda Hussain, Rajarshi Bhattacharya, Ismail Ben Ayed, Jose Dolz
TL;DR
This work investigates prompting as a learning signal for weakly supervised semantic segmentation by analyzing how text prompts in vision–language models influence class activation maps. The authors reveal that substituting the input class token, rather than optimizing the prompt context, yields larger CAM improvements, and that the ground-truth class is not always the most correlated prompt for a given image. They introduce POLE, a simple yet effective method that selects the most correlated class name per image and augments it with lightweight adaptors to refine multimodal embeddings, achieving state-of-the-art results on Pascal VOC 2012. The approach leverages synonym-based prompts from diverse sources, showing that carefully designed, semantically related prompts can significantly boost pseudo-label quality and segmentation accuracy without heavy annotation costs. Overall, POLE demonstrates the strong potential of prompt learning for WSSS and highlights the importance of prompt design in vision-language fine-tuning.
Abstract
Recently, CLIP-based approaches have exhibited remarkable performance on generalization and few-shot learning tasks, fueled by the power of contrastive language-vision pre-training. In particular, prompt tuning has emerged as an effective strategy to adapt the pre-trained language-vision models to downstream tasks by employing task-related textual tokens. Motivated by this progress, in this work we question whether other fundamental problems, such as weakly supervised semantic segmentation (WSSS), can benefit from prompt tuning. Our findings reveal two interesting observations that shed light on the impact of prompt tuning on WSSS. First, modifying only the class token of the text prompt results in a greater impact on the Class Activation Map (CAM), compared to arguably more complex strategies that optimize the context. And second, the class token associated with the image ground truth does not necessarily correspond to the category that yields the best CAM. Motivated by these observations, we introduce a novel approach based on a PrOmpt cLass lEarning (POLE) strategy. Through extensive experiments we demonstrate that our simple, yet efficient approach achieves SOTA performance in a well-known WSSS benchmark. These results highlight not only the benefits of language-vision models in WSSS but also the potential of prompt learning for this problem. The code is available at https://github.com/rB080/WSS_POLE.
