Table of Contents
Fetching ...

Adaptive Patch Contrast for Weakly Supervised Semantic Segmentation

Wangyu Wu, Tianhong Dai, Zhenhong Chen, Xiaowei Huang, Jimin Xiao, Fei Ma, Renrong Ouyang

TL;DR

A novel ViT-based WSSS method named Adaptive Patch Contrast (APC) is introduced that significantly enhances patch embedding learning for improved segmentation effectiveness, and improves upon the existing multi-stage training framework without CAM by transforming it into an end-to-end single-stage training approach, thereby enhancing training efficiency.

Abstract

Weakly Supervised Semantic Segmentation (WSSS) using only image-level labels has gained significant attention due to its cost-effectiveness. The typical framework involves using image-level labels as training data to generate pixel-level pseudo-labels with refinements. Recently, methods based on Vision Transformers (ViT) have demonstrated superior capabilities in generating reliable pseudo-labels, particularly in recognizing complete object regions, compared to CNN methods. However, current ViT-based approaches have some limitations in the use of patch embeddings, being prone to being dominated by certain abnormal patches, as well as many multi-stage methods being time-consuming and lengthy in training, thus lacking efficiency. Therefore, in this paper, we introduce a novel ViT-based WSSS method named \textit{Adaptive Patch Contrast} (APC) that significantly enhances patch embedding learning for improved segmentation effectiveness. APC utilizes an Adaptive-K Pooling (AKP) layer to address the limitations of previous max pooling selection methods. Additionally, we propose a Patch Contrastive Learning (PCL) to enhance patch embeddings, thereby further improving the final results. Furthermore, we improve upon the existing multi-stage training framework without CAM by transforming it into an end-to-end single-stage training approach, thereby enhancing training efficiency. The experimental results show that our approach is effective and efficient, outperforming other state-of-the-art WSSS methods on the PASCAL VOC 2012 and MS COCO 2014 dataset within a shorter training duration.

Adaptive Patch Contrast for Weakly Supervised Semantic Segmentation

TL;DR

A novel ViT-based WSSS method named Adaptive Patch Contrast (APC) is introduced that significantly enhances patch embedding learning for improved segmentation effectiveness, and improves upon the existing multi-stage training framework without CAM by transforming it into an end-to-end single-stage training approach, thereby enhancing training efficiency.

Abstract

Weakly Supervised Semantic Segmentation (WSSS) using only image-level labels has gained significant attention due to its cost-effectiveness. The typical framework involves using image-level labels as training data to generate pixel-level pseudo-labels with refinements. Recently, methods based on Vision Transformers (ViT) have demonstrated superior capabilities in generating reliable pseudo-labels, particularly in recognizing complete object regions, compared to CNN methods. However, current ViT-based approaches have some limitations in the use of patch embeddings, being prone to being dominated by certain abnormal patches, as well as many multi-stage methods being time-consuming and lengthy in training, thus lacking efficiency. Therefore, in this paper, we introduce a novel ViT-based WSSS method named \textit{Adaptive Patch Contrast} (APC) that significantly enhances patch embedding learning for improved segmentation effectiveness. APC utilizes an Adaptive-K Pooling (AKP) layer to address the limitations of previous max pooling selection methods. Additionally, we propose a Patch Contrastive Learning (PCL) to enhance patch embeddings, thereby further improving the final results. Furthermore, we improve upon the existing multi-stage training framework without CAM by transforming it into an end-to-end single-stage training approach, thereby enhancing training efficiency. The experimental results show that our approach is effective and efficient, outperforming other state-of-the-art WSSS methods on the PASCAL VOC 2012 and MS COCO 2014 dataset within a shorter training duration.
Paper Structure (17 sections, 7 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 17 sections, 7 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: In the case of predicting the specific category 'horse': (a) Previous ViT-based method dosovitskiy2020image uses the highest-scoring patch. (b) Our APC uses adaptive $K$ patches and patch contrastive learning to enhance patch embeddings for image-level classification.
  • Figure 2: This figure compares the basic structure of a multiple-stage WSSS method (light blue) with our APC method (light gray). APC, based on the ViT framework without CAM, allows for a single-stage approach, enabling direct final label acquisition for verification without an additional semantic segmentation model (e.g., DeepLab chen2017deeplab).
  • Figure 3: APC infrastructure: ViT encodes patch embeddings, which are refined by BiLSTM. MLP and softmax produce patch-to-classifier predictions. The AKP module maps patch classifiers to an image classifier using image-level ground truth for supervision. After the Refined Encoder, the PCL module enhances patch similarity.
  • Figure 4: The performance comparison of selecting different values of $\theta$ and $\epsilon$.
  • Figure 5: The performance comparison of selecting different values of $\lambda_1$ and $\lambda_2$.
  • ...and 2 more figures