Table of Contents
Fetching ...

Locally Confident, Globally Stuck: The Quality-Exploration Dilemma in Diffusion Language Models

Liancheng Fang, Aiwei Liu, Henry Peng Zou, Yankai Chen, Enze Ma, Leyi Pan, Chunyu Miao, Wei-Chieh Huang, Xue Liu, Philip S. Yu

Abstract

Diffusion large language models (dLLMs) theoretically permit token decoding in arbitrary order, a flexibility that could enable richer exploration of reasoning paths than autoregressive (AR) LLMs. In practice, however, random-order decoding often hurts generation quality. To mitigate this, low-confidence remasking improves single-sample quality (e.g., Pass@$1$) by prioritizing confident tokens, but it also suppresses exploration and limits multi-sample gains (e.g., Pass@$k$), creating a fundamental quality--exploration dilemma. In this paper, we provide a unified explanation of this dilemma. We show that low-confidence remasking improves a myopic proxy for quality while provably constraining the entropy of the induced sequence distribution. To overcome this limitation, we characterize the optimal distribution that explicitly balances quality and exploration, and develop a simple Independent Metropolis--Hastings sampler that approximately targets this distribution during decoding. Experiments across a range of reasoning benchmarks including MATH500, AIME24/25, HumanEval, and MBPP show that our approach yields better exploration-quality tradeoff than both random and low-confidence remasking.

Locally Confident, Globally Stuck: The Quality-Exploration Dilemma in Diffusion Language Models

Abstract

Diffusion large language models (dLLMs) theoretically permit token decoding in arbitrary order, a flexibility that could enable richer exploration of reasoning paths than autoregressive (AR) LLMs. In practice, however, random-order decoding often hurts generation quality. To mitigate this, low-confidence remasking improves single-sample quality (e.g., Pass@) by prioritizing confident tokens, but it also suppresses exploration and limits multi-sample gains (e.g., Pass@), creating a fundamental quality--exploration dilemma. In this paper, we provide a unified explanation of this dilemma. We show that low-confidence remasking improves a myopic proxy for quality while provably constraining the entropy of the induced sequence distribution. To overcome this limitation, we characterize the optimal distribution that explicitly balances quality and exploration, and develop a simple Independent Metropolis--Hastings sampler that approximately targets this distribution during decoding. Experiments across a range of reasoning benchmarks including MATH500, AIME24/25, HumanEval, and MBPP show that our approach yields better exploration-quality tradeoff than both random and low-confidence remasking.

Paper Structure

This paper contains 33 sections, 8 theorems, 58 equations, 4 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Assume the decoder is $(1-\delta)$-gated, the committed token at each step is sampled from the decoder's commit distribution, and the scoring model is the decoder itself, i.e., $p_{\mathrm{ref}} = p_\theta$. Then for any decoding trajectory $\sigma$, where $h_V(\delta) = h_b(\delta) + \delta \log(|V|-1)$ and $h_b(\delta) = -\delta \log \delta - (1-\delta)\log(1-\delta)$.

Figures (4)

  • Figure 1: The quality--exploration dilemma in dLLM decoding. Confidence remasking achieves high sample quality (Pass@$1$) but plateaus Pass@$k$ quickly due to limited exploration. Conversely, random remasking promotes exploration but degrades individual sample quality. Our global tempering reconciles this trade-off, establishing a new Pareto frontier with superior Pass@$1$ and Pass@$16$ performance.
  • Figure 2: Pass@$\bm{k}$ scaling curves.(Left) Performance on MATH500, HumanEval, and MBPP for WeDLM-8B (top) and LLaDA-8B (bottom). (Right) WeDLM-8B on AIME'24/25.
  • Figure 3: Left: Quality-diversity trade-off on MATH500 (LLaDA) by sweeping temperature parameters. IMH strictly dominates local baselines. Middle: Pass@$32$ on AIME 2024 stratified by difficulty sun2025climbing (WeDLM-8B). IMH yields the largest gains on Hard problems. Right: Trajectory similarity matrix on AIME 2024. Local baselines collapse into similar paths, whereas IMH explores distinctly different reasoning strategies.
  • Figure : Batched IMH for one-token corrected sampling

Theorems & Definitions (13)

  • Definition 1: Confidence gating
  • Proposition 1: Generative perplexity upper bound under confidence gating
  • Proposition 2: Entropy cap under confidence gating
  • Proposition 3: Optimality of the power distribution
  • Proposition 4: Corrected conditional for global tempering
  • Proposition 4: Generative perplexity upper bound under confidence gating
  • proof
  • Proposition 4: Entropy cap under confidence gating
  • proof
  • Proposition 4: Optimality of the power distribution
  • ...and 3 more