Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation
Rozhan Ahmadi, Shohreh Kasaei
TL;DR
This work tackles weakly supervised semantic segmentation with image-level labels by addressing CAM limitations that arise from relying on local discriminative cues. It introduces SWTformer, a Swin Transformer-based framework that fuses local and global context to produce improved seed CAMs, with two variants: V1 uses patch-token CAMs, and V2 adds hierarchical feature fusion and a background-aware refinement inspired by SIPE. The method leverages a multi-task loss consisting of CLS, GSC, and CCL terms to jointly optimize classification, CAM consistency, and class-wise contrast, yielding marked gains in object localization ($mAP$) and seed-map quality ($mIoU$) on PASCAL VOC 2012. The contributions include (i) the first hierarchical transformer-based CAM generator for WSSS, (ii) a patch-token CAM approach without class tokens, and (iii) a multi-scale, background-aware refinement that improves cross-object discrimination. The results demonstrate that SWTformer outperforms state-of-the-art transformers on localization and provides competitive seed CAMs, with code available for reproduction.
Abstract
In recent years, weakly supervised semantic segmentation using image-level labels as supervision has received significant attention in the field of computer vision. Most existing methods have addressed the challenges arising from the lack of spatial information in these labels by focusing on facilitating supervised learning through the generation of pseudo-labels from class activation maps (CAMs). Due to the localized pattern detection of CNNs, CAMs often emphasize only the most discriminative parts of an object, making it challenging to accurately distinguish foreground objects from each other and the background. Recent studies have shown that Vision Transformer (ViT) features, due to their global view, are more effective in capturing the scene layout than CNNs. However, the use of hierarchical ViTs has not been extensively explored in this field. This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs by bringing local and global views together. SWTformer-V1 generates class probabilities and CAMs using only the patch tokens as features. SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information and utilizes a background-aware mechanism to generate more accurate localization maps with improved cross-object discrimination. Based on experiments on the PascalVOC 2012 dataset, SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models. It also yields comparable performance by 0.82% mIoU on average higher than other methods in generating initial localization maps, depending only on the classification network. SWTformer-V2 further improves the accuracy of the generated seed CAMs by 5.32% mIoU, further proving the effectiveness of the local-to-global view provided by the Swin transformer. Code available at: https://github.com/RozhanAhmadi/SWTformer
