Lightweight Transformer Framework for Weakly Supervised Semantic Segmentation
Ali Torabi, Sanjog Gaihre, Yaqoob Majeed
TL;DR
CrispFormer tackles weakly supervised semantic segmentation by embedding three decoder-centered mechanisms—boundary-aware supervision, uncertainty-guided refinement, and dynamic multi-scale fusion—into the SegFormer decoder without altering the backbone or relying on heavy post-processing. The approach uses an EMA teacher-guided seed/relabeling loop and a lightweight, per-pixel uncertainty model to soften supervision and gate residual corrections, yielding crisper boundaries and higher mIoU with minimal overhead. Empirical results on VOC2012 and COCO2014 show consistent improvements in boundary fidelity, small-object recall, and overall segmentation quality, approaching the performance of some fully supervised baselines while maintaining single-pass inference. The method demonstrates practical impact by delivering high-fidelity masks efficiently, with strong calibration and robust performance across datasets, and offers a reproducible path to higher-fidelity masks from image-level supervision.
Abstract
Weakly supervised semantic segmentation (WSSS) must learn dense masks from noisy, under-specified cues. We revisit the SegFormer decoder and show that three small, synergistic changes make weak supervision markedly more effective-without altering the MiT backbone or relying on heavy post-processing. Our method, CrispFormer, augments the decoder with: (1) a boundary branch that supervises thin object contours using a lightweight edge head and a boundary-aware loss; (2) an uncertainty-guided refiner that predicts per-pixel aleatoric uncertainty and uses it to weight losses and gate a residual correction of the segmentation logits; and (3) a dynamic multi-scale fusion layer that replaces static concatenation with spatial softmax gating over multi-resolution features, optionally modulated by uncertainty. The result is a single-pass model that preserves crisp boundaries, selects appropriate scales per location, and resists label noise from weak cues. Integrated into a standard WSSS pipeline (seed, student, and EMA relabeling), CrispFormer consistently improves boundary F-score, small-object recall, and mIoU over SegFormer baselines trained on the same seeds, while adding minimal compute. Our decoder-centric formulation is simple to implement, broadly compatible with existing SegFormer variants, and offers a reproducible path to higher-fidelity masks from image-level supervision.
