Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training

Jaeyeon Kim; Jonathan Geuter; David Alvarez-Melis; Sham Kakade; Sitan Chen

Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training

Jaeyeon Kim, Jonathan Geuter, David Alvarez-Melis, Sham Kakade, Sitan Chen

TL;DR

Masked diffusion models enable any-order generation in discrete spaces but require training over an exponential set of masking patterns, creating a train-test mismatch. Progressive UnMAsKing (PUMA) addresses this by using a teacher-forced chain to align training masking with inference-time unmasking, while preserving the unmasking posterior and the training objective. The approach is supported by marginal-agreement and minimizer-preservation guarantees and yields up to 2.5x faster pretraining at 125M scale, while remaining compatible with autoregressive initialization and block-size curricula. This work introduces a practical, scalable design axis for accelerating discrete diffusion model training and includes open-source code for replication.

Abstract

Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces. By generating sequences in any order and allowing for parallel decoding, they enable fast inference and strong performance on non-causal tasks. However, this flexibility comes with a training complexity trade-off: MDMs train on an exponentially large set of masking patterns, which is not only computationally expensive, but also creates a train--test mismatch between the random masks used in training and the highly structured masks induced by inference-time unmasking. In this work, we propose Progressive UnMAsking (PUMA), a simple modification of the forward masking process that aligns training-time and inference-time masking patterns, thereby focusing optimization on inference-aligned masks and speeding up training. Empirically, PUMA speeds up pretraining at the 125M scale by $\approx 2.5\times$ and offers complementary advantages on top of common recipes like autoregressive initialization. We open-source our codebase at https://github.com/JaeyeonKim01/PUMA.

Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training

TL;DR

Abstract

and offers complementary advantages on top of common recipes like autoregressive initialization. We open-source our codebase at https://github.com/JaeyeonKim01/PUMA.

Paper Structure (28 sections, 8 theorems, 37 equations, 7 figures, 1 table)

This paper contains 28 sections, 8 theorems, 37 equations, 7 figures, 1 table.

Introduction
Preliminaries
Prior work on accelerating MDM training
PUMA: Progressive UnMAsking
Theoretical foundation of PUMA
PUMA's marginal agreement property
PUMA's minimizer guarantee
Empirical instantiation of PUMA
Practical interventions
Sample complexity of PUMA
Experiments
Sudoku puzzle as a testbed
PUMA accelerates pretraining
Ablation study
Combining PUMA with other remedies
...and 13 more sections

Key Result

Proposition 1

Fix a policy $g_\phi$ and consider an idealized MDM inference procedure that, at step (b), samples clean tokens from the ground-truth unmasking posterior $p({\mathbf{x}}_0^i=\cdot\,|\,{\mathbf{z}})$. Let $q_{t}$ denote the distribution of ${\mathbf{x}}_{t}$ under this idealized inference. Then, for

Figures (7)

Figure 1: PUMA (blue) accelerates Masked Diffusion Model training (red) by changing the forward process. Experiment on 125M-scale, from scratch trained on TinyGSM liu2023tinygsm. Moreover, PUMA is compatible with autoregressive initialization. (purple curve over green curve)
Figure 2: Illustration of PUMA. Under PUMA, training examples are generated via the teacher-forced chain with a given ${\mathbf{x}}_0$ (purple) with a current model's policy $g_\phi$ and ${\mathbf{x}}_0$'s clean tokens. In contrast, standard MDM training yields (independently drawn) training samples by the random masking.
Figure 3: PUMA finds unmasking trajectories close to the final model's trajectories (right) early on in training. We show PUMA training unmasking orders at different training steps for a given Sudoku puzzle.
Figure 4: Left: PUMA’s efficiency is largely insensitive to the confidence threshold, except when the threshold is very small. Middle, Right: A single PUMA-trained model (trained under one policy) remains robust to inference-time policy choices, consistently outperforming the baseline across different unmasking policies. (Top-K margin, Entropy)
Figure 5: PUMA speeds up training on Sudoku by $1.4\times$.
...and 2 more figures

Theorems & Definitions (15)

Proposition 1: Informal
Proposition 2: Informal
Proposition 3: Informal
Proposition 1: Formal
proof
Lemma 4: Weighted cross-entropy is minimized by the true conditional
proof
Proposition 2: Formal
proof
Lemma 5
...and 5 more

Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training

TL;DR

Abstract

Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (15)