Table of Contents
Fetching ...

MoRe: Class Patch Attention Needs Regularization for Weakly Supervised Semantic Segmentation

Zhiwei Yang, Yucong Meng, Kexue Fu, Shuo Wang, Zhijian Song

TL;DR

MoRe addresses artifact activations in Localization Attention Maps (LAM) produced by ViT-based weakly supervised semantic segmentation. It introduces Graph Category Representation (GCR) to model class-patch interactions as a dynamic directed graph and Localization-informed Regularization (LIR) to explicitly align class and patch tokens using CAM cues, enabling end-to-end training. The method substantially improves pseudo-label quality and semantic segmentation on PASCAL VOC and MS COCO, outperforming both single-stage and many multi-stage approaches. This work advances WSSS by combining graph-based and CAM-guided regularization to produce high-fidelity localizations with better object delineation and fewer false activations.

Abstract

Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically uses Class Activation Maps (CAM) to achieve dense predictions. Recently, Vision Transformer (ViT) has provided an alternative to generate localization maps from class-patch attention. However, due to insufficient constraints on modeling such attention, we observe that the Localization Attention Maps (LAM) often struggle with the artifact issue, i.e., patch regions with minimal semantic relevance are falsely activated by class tokens. In this work, we propose MoRe to address this issue and further explore the potential of LAM. Our findings suggest that imposing additional regularization on class-patch attention is necessary. To this end, we first view the attention as a novel directed graph and propose the Graph Category Representation module to implicitly regularize the interaction among class-patch entities. It ensures that class tokens dynamically condense the related patch information and suppress unrelated artifacts at a graph level. Second, motivated by the observation that CAM from classification weights maintains smooth localization of objects, we devise the Localization-informed Regularization module to explicitly regularize the class-patch attention. It directly mines the token relations from CAM and further supervises the consistency between class and patch tokens in a learnable manner. Extensive experiments are conducted on PASCAL VOC and MS COCO, validating that MoRe effectively addresses the artifact issue and achieves state-of-the-art performance, surpassing recent single-stage and even multi-stage methods. Code is available at https://github.com/zwyang6/MoRe.

MoRe: Class Patch Attention Needs Regularization for Weakly Supervised Semantic Segmentation

TL;DR

MoRe addresses artifact activations in Localization Attention Maps (LAM) produced by ViT-based weakly supervised semantic segmentation. It introduces Graph Category Representation (GCR) to model class-patch interactions as a dynamic directed graph and Localization-informed Regularization (LIR) to explicitly align class and patch tokens using CAM cues, enabling end-to-end training. The method substantially improves pseudo-label quality and semantic segmentation on PASCAL VOC and MS COCO, outperforming both single-stage and many multi-stage approaches. This work advances WSSS by combining graph-based and CAM-guided regularization to produce high-fidelity localizations with better object delineation and fewer false activations.

Abstract

Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically uses Class Activation Maps (CAM) to achieve dense predictions. Recently, Vision Transformer (ViT) has provided an alternative to generate localization maps from class-patch attention. However, due to insufficient constraints on modeling such attention, we observe that the Localization Attention Maps (LAM) often struggle with the artifact issue, i.e., patch regions with minimal semantic relevance are falsely activated by class tokens. In this work, we propose MoRe to address this issue and further explore the potential of LAM. Our findings suggest that imposing additional regularization on class-patch attention is necessary. To this end, we first view the attention as a novel directed graph and propose the Graph Category Representation module to implicitly regularize the interaction among class-patch entities. It ensures that class tokens dynamically condense the related patch information and suppress unrelated artifacts at a graph level. Second, motivated by the observation that CAM from classification weights maintains smooth localization of objects, we devise the Localization-informed Regularization module to explicitly regularize the class-patch attention. It directly mines the token relations from CAM and further supervises the consistency between class and patch tokens in a learnable manner. Extensive experiments are conducted on PASCAL VOC and MS COCO, validating that MoRe effectively addresses the artifact issue and achieves state-of-the-art performance, surpassing recent single-stage and even multi-stage methods. Code is available at https://github.com/zwyang6/MoRe.

Paper Structure

This paper contains 29 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Our motivation. Localization Attention Maps (LAM) from ViT provide an alternative to CAM. (a) Since no regularization is conducted between class-patch attention, (b) LAM particularly suffers from the artifact issue. (c) We propose MoRe to tackle it and generate better LAM by regularizing attention among class-patch tokens.
  • Figure 2: Overview of our MoRe. The input image is sent to ViT encoder and generates multi-class and patch tokens. (a) We first send them into Graph Category Representation (GCR) module, which takes the class-patch attention as a directed graph with the projected entities head $h_i$, tail $t_j$, and learnable edge $e_{ij}$. (b) Then Graph Aggregation mechanism is designed to condense the related tail semantics into class tokens. (c) CAM is also generated from patches. It acts as the confident and uncertain relation supervision to the proposed Localization-informed Regularization (LIR) module with two objectives $\mathcal{L}_{{cre}}$ and $\mathcal{L}_{{ure}}$. Finally, LAM is generated from the similarity score map between class-patch tokens and is used to train a segmentation decoder.
  • Figure 3: Visualization of LAM. (a) image. (b) LAM on ViT pretrained on ImageNet. (c) LAM with PTC loss 19 for tackling over-smoothness of ViT. (d) LAM on DeiT. (e) Patch CAM from MCTformer+ 22. (f) Refined LAM with fusion strategy by multiplying both LAM and CAM. (g) LAM with our designed LIR module. (h) LAM with both our designed LIR and GCR modules. (i) Ground truth. More visualized results are showcased in Appendix.
  • Figure 4: Segmentation visualization with SOTA single-stage methods (i.e., ToCo and DuPL) on VOC and COCO.
  • Figure 5: Multi-class token representation between MCTformer+ and MoRe on VOC train set, visualized with t-SNE.