Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training
Jaeyeon Kim, Jonathan Geuter, David Alvarez-Melis, Sham Kakade, Sitan Chen
TL;DR
Masked diffusion models enable any-order generation in discrete spaces but require training over an exponential set of masking patterns, creating a train-test mismatch. Progressive UnMAsKing (PUMA) addresses this by using a teacher-forced chain to align training masking with inference-time unmasking, while preserving the unmasking posterior and the training objective. The approach is supported by marginal-agreement and minimizer-preservation guarantees and yields up to 2.5x faster pretraining at 125M scale, while remaining compatible with autoregressive initialization and block-size curricula. This work introduces a practical, scalable design axis for accelerating discrete diffusion model training and includes open-source code for replication.
Abstract
Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces. By generating sequences in any order and allowing for parallel decoding, they enable fast inference and strong performance on non-causal tasks. However, this flexibility comes with a training complexity trade-off: MDMs train on an exponentially large set of masking patterns, which is not only computationally expensive, but also creates a train--test mismatch between the random masks used in training and the highly structured masks induced by inference-time unmasking. In this work, we propose Progressive UnMAsking (PUMA), a simple modification of the forward masking process that aligns training-time and inference-time masking patterns, thereby focusing optimization on inference-aligned masks and speeding up training. Empirically, PUMA speeds up pretraining at the 125M scale by $\approx 2.5\times$ and offers complementary advantages on top of common recipes like autoregressive initialization. We open-source our codebase at https://github.com/JaeyeonKim01/PUMA.
