Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models
Omer Luxembourg, Haim Permuter, Eliya Nachmani
TL;DR
The paper tackles slow, non-autoregressive diffusion-based language modeling by introducing the Dilated Unmasking Scheduler (DUS), an inference-only planner that unmasks tokens in logarithmically many rounds per block, reducing denoiser calls from $O(B)$ to $O(\log B)$. It formalizes the MDLM framework, proves a joint-entropy bound under fast-mixing Markov assumptions, and leverages spacing, contextual conditioning, and a skip mechanism to maintain quality. Empirically, DUS outperforms traditional self-confidence planners across math, coding, and general knowledge tasks while delivering substantial speedups (up to 10x) and improved or preserved accuracy, demonstrating a practical, training-free path to faster diffusion-based LLMs. The results highlight a new speed-quality frontier for MDLMs and motivate further exploration of inference-time planning strategies that exploit diffusion’s inherent parallelism.
Abstract
Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP) and general-knowledge benchmarks (BBH, MMLU-Pro), DUS outperforms confidence-based planners, without modifying the underlying denoiser, and reveals the true speed-quality frontier of MDLMs.
