Table of Contents
Fetching ...

Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models

Sanghyun Lee, Seungryong Kim, Jongho Park, Dongmin Park

TL;DR

This paper tackles the problem that decoding in Diffusion Language Models (DLMs) is highly sensitive to the unmasking order, and greedy, locally focused strategies can yield irrecoverable errors. It introduces LookUM, an unsupervised, path-based framework that couples a path generator with an uncertainty-based verifier to perform lookahead unmasking, guided by sequence-level certainty rather than local confidences. Across six benchmarks for mathematics, coding, and planning, LookUM yields consistent gains with only 2–4 additional inference paths, and it provides complementary benefits to RL-tuned training without requiring external reward models. The approach demonstrates robustness across base and RL-tuned LLaDA models, offers scalable compute, and establishes uncertainty-driven path selection as a practical, general mechanism for improving diffusion language models. This work thus broadens inference-time optimization for discrete diffusion and suggests promising future directions into leveraging more intrinsic signals for verification.

Abstract

Masked Diffusion Models (MDMs) as language models generate by iteratively unmasking tokens, yet their performance crucially depends on the inference time order of unmasking. Prevailing heuristics, such as confidence based sampling, are myopic: they optimize locally, fail to leverage extra test-time compute, and let early decoding mistakes cascade. We propose Lookahead Unmasking (LookUM), which addresses these concerns by reformulating sampling as path selection over all possible unmasking orders without the need for an external reward model. Our framework couples (i) a path generator that proposes paths by sampling from pools of unmasking sets with (ii) a verifier that computes the uncertainty of the proposed paths and performs importance sampling to subsequently select the final paths. Empirically, erroneous unmasking measurably inflates sequence level uncertainty, and our method exploits this to avoid error-prone trajectories. We validate our framework across six benchmarks, such as mathematics, planning, and coding, and demonstrate consistent performance improvements. LookUM requires only two to three paths to achieve peak performance, demonstrating remarkably efficient path selection. The consistent improvements on both LLaDA and post-trained LLaDA 1.5 are particularly striking: base LLaDA with LookUM rivals the performance of RL-tuned LLaDA 1.5, while LookUM further enhances LLaDA 1.5 itself showing that uncertainty based verification provides orthogonal benefits to reinforcement learning and underscoring the versatility of our framework. Code will be publicly released.

Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models

TL;DR

This paper tackles the problem that decoding in Diffusion Language Models (DLMs) is highly sensitive to the unmasking order, and greedy, locally focused strategies can yield irrecoverable errors. It introduces LookUM, an unsupervised, path-based framework that couples a path generator with an uncertainty-based verifier to perform lookahead unmasking, guided by sequence-level certainty rather than local confidences. Across six benchmarks for mathematics, coding, and planning, LookUM yields consistent gains with only 2–4 additional inference paths, and it provides complementary benefits to RL-tuned training without requiring external reward models. The approach demonstrates robustness across base and RL-tuned LLaDA models, offers scalable compute, and establishes uncertainty-driven path selection as a practical, general mechanism for improving diffusion language models. This work thus broadens inference-time optimization for discrete diffusion and suggests promising future directions into leveraging more intrinsic signals for verification.

Abstract

Masked Diffusion Models (MDMs) as language models generate by iteratively unmasking tokens, yet their performance crucially depends on the inference time order of unmasking. Prevailing heuristics, such as confidence based sampling, are myopic: they optimize locally, fail to leverage extra test-time compute, and let early decoding mistakes cascade. We propose Lookahead Unmasking (LookUM), which addresses these concerns by reformulating sampling as path selection over all possible unmasking orders without the need for an external reward model. Our framework couples (i) a path generator that proposes paths by sampling from pools of unmasking sets with (ii) a verifier that computes the uncertainty of the proposed paths and performs importance sampling to subsequently select the final paths. Empirically, erroneous unmasking measurably inflates sequence level uncertainty, and our method exploits this to avoid error-prone trajectories. We validate our framework across six benchmarks, such as mathematics, planning, and coding, and demonstrate consistent performance improvements. LookUM requires only two to three paths to achieve peak performance, demonstrating remarkably efficient path selection. The consistent improvements on both LLaDA and post-trained LLaDA 1.5 are particularly striking: base LLaDA with LookUM rivals the performance of RL-tuned LLaDA 1.5, while LookUM further enhances LLaDA 1.5 itself showing that uncertainty based verification provides orthogonal benefits to reinforcement learning and underscoring the versatility of our framework. Code will be publicly released.

Paper Structure

This paper contains 44 sections, 9 equations, 3 figures, 4 tables, 3 algorithms.

Figures (3)

  • Figure 1: Standard unmasking vs. LookUM in discrete diffusion models. During the denoising process of unmasking from timestep $T$ to $0$, greedy approaches often select the position with the highest token-level certainty which can lead to an incorrect unmasking order and result in local errors (red). In contrast LookUM generates candidate unmasking paths and leverages a verifier to select those that avoid local errors and recover the correct sequence (blue).
  • Figure 2: Local Error Compare and Example. (a) Sentence-level accuracy on GSM8K and MATH500, showing our method achieves approximately 10$\%$ lower error rates than baselines. (b) Example of greedy unmasking producing a computational error (180 × 0.7 = 12.6) while our method generates the correct result (126).
  • Figure 3: Scaling results. Performance scaling with lookahead paths. Accuracy versus number of particle paths on three benchmarks (GSM8K, MATH500, Countdown). Sharp improvements occur up to 4 particles, after which performance saturates, demonstrating efficient scaling with limited computational budget.