Masked Diffusion as Self-supervised Representation Learner
Zixuan Pan, Jianxu Chen, Yiyu Shi
TL;DR
This work introduces Masked Diffusion Model (MDM), a self-supervised pre-training paradigm for semantic segmentation that replaces additive Gaussian noise with a masking corruption strategy and optimizes using SSIM to better align with downstream tasks. By freezing the pre-trained MDM as a representation generator and training a lightweight segmentation head, the method achieves state-of-the-art results on medical and natural segmentation benchmarks, particularly in few-shot settings. The key findings show that diffusion denoising is not strictly necessary for high-quality semantic representations, that masking-based pre-training can outperform MAE and DDPM, and that SSIM is a crucial loss for bridging reconstruction to segmentation. The approach has strong practical implications for label-efficient dense prediction, with potential extensions to broader architectures and data domains.
Abstract
Denoising diffusion probabilistic models have recently demonstrated state-of-the-art generative performance and have been used as strong pixel-level representation learners. This paper decomposes the interrelation between the generative capability and representation learning ability inherent in diffusion models. We present the masked diffusion model (MDM), a scalable self-supervised representation learner for semantic segmentation, substituting the conventional additive Gaussian noise of traditional diffusion with a masking mechanism. Our proposed approach convincingly surpasses prior benchmarks, demonstrating remarkable advancements in both medical and natural image semantic segmentation tasks, particularly in few-shot scenarios.
