Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Rozhan Ahmadi; Shohreh Kasaei

Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Rozhan Ahmadi, Shohreh Kasaei

TL;DR

This work tackles weakly supervised semantic segmentation with image-level labels by addressing CAM limitations that arise from relying on local discriminative cues. It introduces SWTformer, a Swin Transformer-based framework that fuses local and global context to produce improved seed CAMs, with two variants: V1 uses patch-token CAMs, and V2 adds hierarchical feature fusion and a background-aware refinement inspired by SIPE. The method leverages a multi-task loss consisting of CLS, GSC, and CCL terms to jointly optimize classification, CAM consistency, and class-wise contrast, yielding marked gains in object localization ($mAP$) and seed-map quality ($mIoU$) on PASCAL VOC 2012. The contributions include (i) the first hierarchical transformer-based CAM generator for WSSS, (ii) a patch-token CAM approach without class tokens, and (iii) a multi-scale, background-aware refinement that improves cross-object discrimination. The results demonstrate that SWTformer outperforms state-of-the-art transformers on localization and provides competitive seed CAMs, with code available for reproduction.

Abstract

In recent years, weakly supervised semantic segmentation using image-level labels as supervision has received significant attention in the field of computer vision. Most existing methods have addressed the challenges arising from the lack of spatial information in these labels by focusing on facilitating supervised learning through the generation of pseudo-labels from class activation maps (CAMs). Due to the localized pattern detection of CNNs, CAMs often emphasize only the most discriminative parts of an object, making it challenging to accurately distinguish foreground objects from each other and the background. Recent studies have shown that Vision Transformer (ViT) features, due to their global view, are more effective in capturing the scene layout than CNNs. However, the use of hierarchical ViTs has not been extensively explored in this field. This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs by bringing local and global views together. SWTformer-V1 generates class probabilities and CAMs using only the patch tokens as features. SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information and utilizes a background-aware mechanism to generate more accurate localization maps with improved cross-object discrimination. Based on experiments on the PascalVOC 2012 dataset, SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models. It also yields comparable performance by 0.82% mIoU on average higher than other methods in generating initial localization maps, depending only on the classification network. SWTformer-V2 further improves the accuracy of the generated seed CAMs by 5.32% mIoU, further proving the effectiveness of the local-to-global view provided by the Swin transformer. Code available at: https://github.com/RozhanAhmadi/SWTformer

Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

TL;DR

) and seed-map quality (

) on PASCAL VOC 2012. The contributions include (i) the first hierarchical transformer-based CAM generator for WSSS, (ii) a patch-token CAM approach without class tokens, and (iii) a multi-scale, background-aware refinement that improves cross-object discrimination. The results demonstrate that SWTformer outperforms state-of-the-art transformers on localization and provides competitive seed CAMs, with code available for reproduction.

Abstract

Paper Structure (21 sections, 5 equations, 4 figures, 3 tables)

This paper contains 21 sections, 5 equations, 4 figures, 3 tables.

Introduction
Related Work
Vision Transformers
Weakly Supervised Semantic Segmentation with CNNs
Weakly Supervised Semantic Segmentation with ViTs
Proposed Method
Overview
Generating Class Activation Maps from Patch Tokens
Multi-label Classification Training
Hierarchical Feature Fusion
Background-aware Prototype Exploration
Experiments
Dataset
Evaluation Metrics
Implementation Details
...and 6 more sections

Figures (4)

Figure 1: Class activation maps generated by a (a) CNN (Resnet-50), (b) ViT (DeiT-S) and (c) HVT (Swin-T). Red and yellow boxes indicate the large and small scale objects relative to the image size.
Figure 2: An overview of the proposed SWTformer (V2). The backbone is the Swin-T version of the Swin Transformer and the training of the model is optimized by the CLS, GSC and CCL loss functions. The “Structure-aware seed locating” and “Background-aware prototype modeling” modules are adopted from SIPE b23 with modifications.
Figure 3: Illustration of the proposed hierarchical feature fusion (HFF) module in SWTformer.
Figure 4: Qualitative results of the class activation maps generated by SWTformer on PASCAL VOC 2012 train set. Images contain singular or multiple class labels.

Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

TL;DR

Abstract

Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)