Table of Contents
Fetching ...

Calibrating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation

Lechao Cheng, Zerun Liu, Jingxuan He, Chaowei Fang, Dingwen Zhang, Meng Wang

TL;DR

This work tackles weakly supervised semantic segmentation with vision transformers by identifying undisciplined over-smoothing in deep self-attention as a source of background noise in CAM refinement. It introduces Adaptive Re-Activation Mechanism (AReAM), which uses entropy-based re-activation weights $w^{(l)}$ and counterpart weights $\hat{w}^{(l)}$ to couple shallow and deep-layer affinities, guiding deep-layer convergence toward semantic regions. The method includes per-layer affinity regularization and a random-walk style refinement, plus calibrated pseudo-labels via Pixel-Adaptive Refinement for training. Experiments on PASCAL VOC 2012 and MS COCO 2014 show significant improvements over prior WSSS approaches, confirming that disciplined affinity fusion can sharpen CAMs and reduce background noise.

Abstract

Weakly supervised semantic segmentation (WSSS) has recently attracted considerable attention because it requires fewer annotations than fully supervised approaches, making it especially promising for large-scale image segmentation tasks. Although many vision transformer-based methods leverage self-attention affinity matrices to refine Class Activation Maps (CAMs), they often treat each layer's affinity equally and thus introduce considerable background noise at deeper layers, where attention tends to converge excessively on certain tokens (i.e., over-smoothing). We observe that this deep-level attention naturally converges on a subset of tokens, yet unregulated query-key affinity can generate unpredictable activation patterns (undisciplined over-smoothing), adversely affecting CAM accuracy. To address these limitations, we propose an Adaptive Re-Activation Mechanism (AReAM), which exploits shallow-level affinity to guide deeper-layer convergence in an entropy-aware manner, thereby suppressing background noise and re-activating crucial semantic regions in the CAMs. Experiments on two commonly used datasets demonstrate that AReAM substantially improves segmentation performance compared with existing WSSS methods, reducing noise while sharpening focus on relevant semantic regions. Overall, this work underscores the importance of controlling deep-level attention to mitigate undisciplined over-smoothing, introduces an entropy-aware mechanism that harmonizes shallow and deep-level affinities, and provides a refined approach to enhance transformer-based WSSS accuracy by re-activating CAMs.

Calibrating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation

TL;DR

This work tackles weakly supervised semantic segmentation with vision transformers by identifying undisciplined over-smoothing in deep self-attention as a source of background noise in CAM refinement. It introduces Adaptive Re-Activation Mechanism (AReAM), which uses entropy-based re-activation weights and counterpart weights to couple shallow and deep-layer affinities, guiding deep-layer convergence toward semantic regions. The method includes per-layer affinity regularization and a random-walk style refinement, plus calibrated pseudo-labels via Pixel-Adaptive Refinement for training. Experiments on PASCAL VOC 2012 and MS COCO 2014 show significant improvements over prior WSSS approaches, confirming that disciplined affinity fusion can sharpen CAMs and reduce background noise.

Abstract

Weakly supervised semantic segmentation (WSSS) has recently attracted considerable attention because it requires fewer annotations than fully supervised approaches, making it especially promising for large-scale image segmentation tasks. Although many vision transformer-based methods leverage self-attention affinity matrices to refine Class Activation Maps (CAMs), they often treat each layer's affinity equally and thus introduce considerable background noise at deeper layers, where attention tends to converge excessively on certain tokens (i.e., over-smoothing). We observe that this deep-level attention naturally converges on a subset of tokens, yet unregulated query-key affinity can generate unpredictable activation patterns (undisciplined over-smoothing), adversely affecting CAM accuracy. To address these limitations, we propose an Adaptive Re-Activation Mechanism (AReAM), which exploits shallow-level affinity to guide deeper-layer convergence in an entropy-aware manner, thereby suppressing background noise and re-activating crucial semantic regions in the CAMs. Experiments on two commonly used datasets demonstrate that AReAM substantially improves segmentation performance compared with existing WSSS methods, reducing noise while sharpening focus on relevant semantic regions. Overall, this work underscores the importance of controlling deep-level attention to mitigate undisciplined over-smoothing, introduces an entropy-aware mechanism that harmonizes shallow and deep-level affinities, and provides a refined approach to enhance transformer-based WSSS accuracy by re-activating CAMs.
Paper Structure (15 sections, 15 equations, 9 figures, 4 tables)

This paper contains 15 sections, 15 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: We illustrate how affinity matrices at different transformer layers influence CAM refinement by reporting the mIoU (%) results for a vision transformer with 6, 8, 10, or 12 total layers in each sub-figure. In each scenario, the dotted horizontal line (non-refined) denotes the original CAM, while the solid horizontal line (refined (avg)) represents the refined CAM $\bar{\mathbf{M}}$, obtained by aggregating the average affinity matrix $\bar{\mathbf{A}}$ from all layers. The blue shaded area highlights the performance gain over the non-refined baseline, and the dashed lines (refined) provide layer-wise insights, showing that shallow-layer affinity notably improves CAM quality whereas deeper-layer affinity can diminish it. Overall, these results underscore the importance of query-key affinities in enhancing CAMs and reveal that effectively leveraging global relationships across layers remains an open challenge.
  • Figure 2: Affinity-based CAM refinement.
  • Figure 3: Analysis of the affinity matrix. Each row represents a query token while each column a key token. (left) Visualization of the normalized affinity matrix in each layer. (right) Hellinger distance within the normalized affinity matrix in each layer.
  • Figure 4: Examples of refined CAMs utilizing the affinity matrix at shallower layers (2, 4) and at deeper layers (10, 12).
  • Figure 5: The overview of the proposed AReAM. At the training stage, AReAM applies entropy-aware $w^{(l)}$ to the affinity matrix of each layer to adjust its contribution. To address undisciplined over-smoothing, we supervise the affinity matrix of each layer with $\hat{w}^{(l)}$ to instruct attended tokens to converge to semantic objects. At the inference stage, the disciplined affinity matrices of all layers are averaged to refine CAMs.
  • ...and 4 more figures