Table of Contents
Fetching ...

Lightweight Transformer Framework for Weakly Supervised Semantic Segmentation

Ali Torabi, Sanjog Gaihre, Yaqoob Majeed

TL;DR

CrispFormer tackles weakly supervised semantic segmentation by embedding three decoder-centered mechanisms—boundary-aware supervision, uncertainty-guided refinement, and dynamic multi-scale fusion—into the SegFormer decoder without altering the backbone or relying on heavy post-processing. The approach uses an EMA teacher-guided seed/relabeling loop and a lightweight, per-pixel uncertainty model to soften supervision and gate residual corrections, yielding crisper boundaries and higher mIoU with minimal overhead. Empirical results on VOC2012 and COCO2014 show consistent improvements in boundary fidelity, small-object recall, and overall segmentation quality, approaching the performance of some fully supervised baselines while maintaining single-pass inference. The method demonstrates practical impact by delivering high-fidelity masks efficiently, with strong calibration and robust performance across datasets, and offers a reproducible path to higher-fidelity masks from image-level supervision.

Abstract

Weakly supervised semantic segmentation (WSSS) must learn dense masks from noisy, under-specified cues. We revisit the SegFormer decoder and show that three small, synergistic changes make weak supervision markedly more effective-without altering the MiT backbone or relying on heavy post-processing. Our method, CrispFormer, augments the decoder with: (1) a boundary branch that supervises thin object contours using a lightweight edge head and a boundary-aware loss; (2) an uncertainty-guided refiner that predicts per-pixel aleatoric uncertainty and uses it to weight losses and gate a residual correction of the segmentation logits; and (3) a dynamic multi-scale fusion layer that replaces static concatenation with spatial softmax gating over multi-resolution features, optionally modulated by uncertainty. The result is a single-pass model that preserves crisp boundaries, selects appropriate scales per location, and resists label noise from weak cues. Integrated into a standard WSSS pipeline (seed, student, and EMA relabeling), CrispFormer consistently improves boundary F-score, small-object recall, and mIoU over SegFormer baselines trained on the same seeds, while adding minimal compute. Our decoder-centric formulation is simple to implement, broadly compatible with existing SegFormer variants, and offers a reproducible path to higher-fidelity masks from image-level supervision.

Lightweight Transformer Framework for Weakly Supervised Semantic Segmentation

TL;DR

CrispFormer tackles weakly supervised semantic segmentation by embedding three decoder-centered mechanisms—boundary-aware supervision, uncertainty-guided refinement, and dynamic multi-scale fusion—into the SegFormer decoder without altering the backbone or relying on heavy post-processing. The approach uses an EMA teacher-guided seed/relabeling loop and a lightweight, per-pixel uncertainty model to soften supervision and gate residual corrections, yielding crisper boundaries and higher mIoU with minimal overhead. Empirical results on VOC2012 and COCO2014 show consistent improvements in boundary fidelity, small-object recall, and overall segmentation quality, approaching the performance of some fully supervised baselines while maintaining single-pass inference. The method demonstrates practical impact by delivering high-fidelity masks efficiently, with strong calibration and robust performance across datasets, and offers a reproducible path to higher-fidelity masks from image-level supervision.

Abstract

Weakly supervised semantic segmentation (WSSS) must learn dense masks from noisy, under-specified cues. We revisit the SegFormer decoder and show that three small, synergistic changes make weak supervision markedly more effective-without altering the MiT backbone or relying on heavy post-processing. Our method, CrispFormer, augments the decoder with: (1) a boundary branch that supervises thin object contours using a lightweight edge head and a boundary-aware loss; (2) an uncertainty-guided refiner that predicts per-pixel aleatoric uncertainty and uses it to weight losses and gate a residual correction of the segmentation logits; and (3) a dynamic multi-scale fusion layer that replaces static concatenation with spatial softmax gating over multi-resolution features, optionally modulated by uncertainty. The result is a single-pass model that preserves crisp boundaries, selects appropriate scales per location, and resists label noise from weak cues. Integrated into a standard WSSS pipeline (seed, student, and EMA relabeling), CrispFormer consistently improves boundary F-score, small-object recall, and mIoU over SegFormer baselines trained on the same seeds, while adding minimal compute. Our decoder-centric formulation is simple to implement, broadly compatible with existing SegFormer variants, and offers a reproducible path to higher-fidelity masks from image-level supervision.

Paper Structure

This paper contains 22 sections, 13 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Qualitative comparison of our weakly supervised semantic segmentation results on PASCAL VOC 2012. From left to right: input image, ground truth, fully supervised SegFormer-B5 baseline, and our WSSS method trained with only image-level labels.
  • Figure 2: Decoder overview. MiT encoder features $\{C_i\}$ are projected to $1/4$ resolution tokens $\{E_i\}$, fused by DMF, and decoded by three lightweight heads: segmentation logits $Z$, variance-driven refinement, and boundary supervision. All components stay inside the SegFormer decoder.
  • Figure 3: Qualitative comparisons with confidence cues. From left to right: input image, CLIMS, CLIP-ES, ground truth (GT), our prediction, and our confidence map (higher values = more uncertain). Our decoder yields complete, compact objects with crisp but non-jagged boundaries across multi-monitor, railcar, occlusion, and rider scenes. Confidence concentrates on true ambiguous regions (contours, thin parts, occlusions), aligning with our uncertainty-guided refinement and boundary-aware training. No test-time CRF or multi-pass inference is used.
  • Figure 4: Supplementary VOC12 smoothness study. Each panel shows the input, CLIMS, CLIP-ES, ground truth, our prediction, and the TV/Compactness scores, highlighting sharper boundaries without test-time post-processing.