Table of Contents
Fetching ...

Prompt Categories Cluster for Weakly Supervised Semantic Segmentation

Wangyu Wu, Xianglin Qiu, Siqi Song, Xiaowei Huang, Fei Ma, Jimin Xiao

TL;DR

This work tackles weakly supervised semantic segmentation (WSSS) with image-level labels by addressing semantic ambiguity among similar classes and ViT-induced over-smoothing. It introduces Prompt Categories Clustering (PCC), a ViT-based framework that uses GPT-generated category clusters as learnable tokens, concatenated with patch tokens and refined by HV-BiLSTM; training relies on Top-$k$ pooling and multi-label cross-entropy, with CRF and DeepLabv2 used for final refinement. A key novelty is the self-refining prompt mechanism that derives coherent category clusters and a cluster-token integration that enables sharing information across related categories (e.g., animals). The approach yields state-of-the-art results on PASCAL VOC 2012, improving pseudo-label quality and segmentation accuracy, and demonstrates the practical benefit of language-informed clustering in visual scene understanding.

Abstract

Weakly Supervised Semantic Segmentation (WSSS), which leverages image-level labels, has garnered significant attention due to its cost-effectiveness. The previous methods mainly strengthen the inter-class differences to avoid class semantic ambiguity which may lead to erroneous activation. However, they overlook the positive function of some shared information between similar classes. Categories within the same cluster share some similar features. Allowing the model to recognize these features can further relieve the semantic ambiguity between these classes. To effectively identify and utilize this shared information, in this paper, we introduce a novel WSSS framework called Prompt Categories Clustering (PCC). Specifically, we explore the ability of Large Language Models (LLMs) to derive category clusters through prompts. These clusters effectively represent the intrinsic relationships between categories. By integrating this relational information into the training network, our model is able to better learn the hidden connections between categories. Experimental results demonstrate the effectiveness of our approach, showing its ability to enhance performance on the PASCAL VOC 2012 dataset and surpass existing state-of-the-art methods in WSSS.

Prompt Categories Cluster for Weakly Supervised Semantic Segmentation

TL;DR

This work tackles weakly supervised semantic segmentation (WSSS) with image-level labels by addressing semantic ambiguity among similar classes and ViT-induced over-smoothing. It introduces Prompt Categories Clustering (PCC), a ViT-based framework that uses GPT-generated category clusters as learnable tokens, concatenated with patch tokens and refined by HV-BiLSTM; training relies on Top- pooling and multi-label cross-entropy, with CRF and DeepLabv2 used for final refinement. A key novelty is the self-refining prompt mechanism that derives coherent category clusters and a cluster-token integration that enables sharing information across related categories (e.g., animals). The approach yields state-of-the-art results on PASCAL VOC 2012, improving pseudo-label quality and segmentation accuracy, and demonstrates the practical benefit of language-informed clustering in visual scene understanding.

Abstract

Weakly Supervised Semantic Segmentation (WSSS), which leverages image-level labels, has garnered significant attention due to its cost-effectiveness. The previous methods mainly strengthen the inter-class differences to avoid class semantic ambiguity which may lead to erroneous activation. However, they overlook the positive function of some shared information between similar classes. Categories within the same cluster share some similar features. Allowing the model to recognize these features can further relieve the semantic ambiguity between these classes. To effectively identify and utilize this shared information, in this paper, we introduce a novel WSSS framework called Prompt Categories Clustering (PCC). Specifically, we explore the ability of Large Language Models (LLMs) to derive category clusters through prompts. These clusters effectively represent the intrinsic relationships between categories. By integrating this relational information into the training network, our model is able to better learn the hidden connections between categories. Experimental results demonstrate the effectiveness of our approach, showing its ability to enhance performance on the PASCAL VOC 2012 dataset and surpass existing state-of-the-art methods in WSSS.

Paper Structure

This paper contains 16 sections, 4 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) In tradtional ViT-based WSSS method, only the ViT tokens represent image patchs are used for classification. (b) In our PCC, we use GPT to judge the cluster category each image belongs to and generate cluster token that contains shared class information to enhance ViT patch tokens.
  • Figure 2: The overall framework of PCC is as follows: First, ViT is used to generate image patch tokens. A category list, covering all categories in the dataset, is processed by GPT to cluster these categories and determine which clusters the input image belongs to. In cluster vector, 1 indicates belonging to the cluster, 0 indicates not belonging, and then the cluster vector is multiplied by a learnable matrix to obtain a cluster token that contains cluster information. The cluster token is concatenated with the patch tokens and input into a HV-BiLSTM to generate refined tokens. Finally, refined tokens are passed through patch classifier to obtain patch predictions, and we perform top-k pooling on the predictions to obtain image predictions and conduct MCE loss with image GT label.
  • Figure 3: The proposed Prompt strategy.(a) is the refined prompt template; (b) is the initial prompt template; (c) is the samples of category clusters.
  • Figure 4: Visualization of segmentation results on the val set of PASCAL VOC.
  • Figure 5: The sample illustrates probabilities highlighted by pixel probabilities.