Table of Contents
Fetching ...

Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

Omer Luxembourg, Haim Permuter, Eliya Nachmani

TL;DR

The paper tackles slow, non-autoregressive diffusion-based language modeling by introducing the Dilated Unmasking Scheduler (DUS), an inference-only planner that unmasks tokens in logarithmically many rounds per block, reducing denoiser calls from $O(B)$ to $O(\log B)$. It formalizes the MDLM framework, proves a joint-entropy bound under fast-mixing Markov assumptions, and leverages spacing, contextual conditioning, and a skip mechanism to maintain quality. Empirically, DUS outperforms traditional self-confidence planners across math, coding, and general knowledge tasks while delivering substantial speedups (up to 10x) and improved or preserved accuracy, demonstrating a practical, training-free path to faster diffusion-based LLMs. The results highlight a new speed-quality frontier for MDLMs and motivate further exploration of inference-time planning strategies that exploit diffusion’s inherent parallelism.

Abstract

Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP) and general-knowledge benchmarks (BBH, MMLU-Pro), DUS outperforms confidence-based planners, without modifying the underlying denoiser, and reveals the true speed-quality frontier of MDLMs.

Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

TL;DR

The paper tackles slow, non-autoregressive diffusion-based language modeling by introducing the Dilated Unmasking Scheduler (DUS), an inference-only planner that unmasks tokens in logarithmically many rounds per block, reducing denoiser calls from to . It formalizes the MDLM framework, proves a joint-entropy bound under fast-mixing Markov assumptions, and leverages spacing, contextual conditioning, and a skip mechanism to maintain quality. Empirically, DUS outperforms traditional self-confidence planners across math, coding, and general knowledge tasks while delivering substantial speedups (up to 10x) and improved or preserved accuracy, demonstrating a practical, training-free path to faster diffusion-based LLMs. The results highlight a new speed-quality frontier for MDLMs and motivate further exploration of inference-time planning strategies that exploit diffusion’s inherent parallelism.

Abstract

Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP) and general-knowledge benchmarks (BBH, MMLU-Pro), DUS outperforms confidence-based planners, without modifying the underlying denoiser, and reveals the true speed-quality frontier of MDLMs.

Paper Structure

This paper contains 20 sections, 2 theorems, 15 equations, 1 figure, 5 tables.

Key Result

Lemma 1

Under a fast-mixing first-order Markov chain, let $i_1<\dots<i_k$ be the indices selected by DUS. Then

Figures (1)

  • Figure 1: Experiments on GSM8K, HumanEval, and MBPP for variate speedup factors - defined by semi-AR inference block size. Higher score (Accuracy / Pass@1) is better. Each color represent a different model, while marker indicates the two planners tested - self-confidence $\blacksquare$; DUS $\blacktriangle$. Across all datasets and speedups, DUS achieves higher scores compared to the traditional planner.

Theorems & Definitions (2)

  • Lemma 1
  • Lemma 2