Table of Contents
Fetching ...

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

Ivan Sedykh, Nikita Sorokin, Valentin Malykh

Abstract

Recent advances in masked diffusion language models (MDLMs) narrow the quality gap to autoregressive LMs, but their sampling remains expensive because generation requires many full-sequence denoising passes with a large Transformer and, unlike autoregressive decoding, cannot benefit from KV caching. In this work, we exploit the flexibility of the diffusion framework and study model scheduling, where a smaller MDLM replaces the full model at a subset of denoising steps. On OpenWebText, we show that early and late denoising steps are substantially more robust to such replacement than middle steps, enabling up to a 17% reduction in FLOPs with only modest degradation in generative perplexity. We support these findings with a step-importance analysis based on loss and KL divergence between small and large models across timesteps, as well as an exhaustive search over coarse step segments, both of which identify the middle of the diffusion trajectory as most sensitive. Our results suggest that simple, architecture-agnostic scheduling rules can significantly accelerate MDLM sampling while largely preserving generation quality as measured by generative perplexity.

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

Abstract

Recent advances in masked diffusion language models (MDLMs) narrow the quality gap to autoregressive LMs, but their sampling remains expensive because generation requires many full-sequence denoising passes with a large Transformer and, unlike autoregressive decoding, cannot benefit from KV caching. In this work, we exploit the flexibility of the diffusion framework and study model scheduling, where a smaller MDLM replaces the full model at a subset of denoising steps. On OpenWebText, we show that early and late denoising steps are substantially more robust to such replacement than middle steps, enabling up to a 17% reduction in FLOPs with only modest degradation in generative perplexity. We support these findings with a step-importance analysis based on loss and KL divergence between small and large models across timesteps, as well as an exhaustive search over coarse step segments, both of which identify the middle of the diffusion trajectory as most sensitive. Our results suggest that simple, architecture-agnostic scheduling rules can significantly accelerate MDLM sampling while largely preserving generation quality as measured by generative perplexity.

Paper Structure

This paper contains 28 sections, 8 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Generative perplexity (GPT-2) for hand-crafted model schedules using a heavy 12-block model and a light 4-block model with exactly 250/1000 light steps (16.7% saved FLOPs). Each bar label encodes a schedule as contiguous segments, e.g., $(\mathrm{L}125,\mathrm{H}750,\mathrm{L}125)$ denotes the sandwich schedule (125 light steps, 750 heavy steps, 125 light steps), while placing all light steps in the 2nd or 3rd quarter yields the worst perplexity. Errorbars correspond to 95% confidence intervals.
  • Figure 2: Comparison of the top 5 best (left) and worst (right) model scheduling configurations among the 210 coarse schedules. Each row shows one configuration. Red bars indicate light (4-block) model placement. Segments 0--9 correspond to steps 0--100, 100--200, …, 900--1000, where segment 0 is closest to the fully masked state ($t\approx 1$). Best configurations concentrate light segments near both ends, while worst configurations place light segments in the middle.
  • Figure 3: Segment frequency in the top 20 best-performing configurations (lowest perplexity). Bars show how often each segment is assigned to the light (4-block) model across the top-20 schedules. Higher frequency suggests that replacing this segment is relatively safe.
  • Figure 4: Segment frequency in the bottom 20 worst-performing configurations (highest perplexity). Bars show how often each segment is assigned to the light (4-block) model across the bottom-20 schedules. Higher frequency suggests that replacing this segment is harmful.
  • Figure 5: Mean absolute difference in masked-token cross-entropy between each light model and the heavy 12-block baseline across timesteps. Each curve compares one light model to the baseline, evaluated on the same corrupted inputs $z_t$. Lower values indicate higher similarity.
  • ...and 6 more figures