Table of Contents
Fetching ...

Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models

Runchu Tian, Junxia Cui, Xueqiang Xu, Feng Yao, Jingbo Shang

TL;DR

Tolerator (Token-Level Cross-Validation Refinement), a training-free decoding strategy that leverages cross-validation among predicted tokens, is proposed, suggesting that decoding algorithms are crucial to realizing the full potential of diffusion large language models.

Abstract

Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) models, offering advantages such as accelerated parallel decoding and bidirectional context modeling. However, the vanilla decoding strategy in discrete dLLMs suffers from a critical limitation: once a token is accepted, it can no longer be revised in subsequent steps. As a result, early mistakes persist across iterations, harming both intermediate predictions and final output quality. To address this issue, we propose Tolerator (Token-Level Cross-Validation Refinement), a training-free decoding strategy that leverages cross-validation among predicted tokens. Unlike existing methods that follow a single progressive unmasking procedure, Tolerator introduces a two-stage process: (i) sequence fill-up and (ii) iterative refinement by remasking and decoding a subset of tokens while treating the remaining as context. This design enables previously accepted tokens to be reconsidered and corrected when necessary, leading to more reliable diffusion decoding outputs. We evaluate Tolerator on five standard benchmarks covering language understanding, code generation, and mathematics. Experiments show that our method achieves consistent improvements over the baselines under the same computational budget. These findings suggest that decoding algorithms are crucial to realizing the full potential of diffusion large language models. Code and data are publicly available.

Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models

TL;DR

Tolerator (Token-Level Cross-Validation Refinement), a training-free decoding strategy that leverages cross-validation among predicted tokens, is proposed, suggesting that decoding algorithms are crucial to realizing the full potential of diffusion large language models.

Abstract

Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) models, offering advantages such as accelerated parallel decoding and bidirectional context modeling. However, the vanilla decoding strategy in discrete dLLMs suffers from a critical limitation: once a token is accepted, it can no longer be revised in subsequent steps. As a result, early mistakes persist across iterations, harming both intermediate predictions and final output quality. To address this issue, we propose Tolerator (Token-Level Cross-Validation Refinement), a training-free decoding strategy that leverages cross-validation among predicted tokens. Unlike existing methods that follow a single progressive unmasking procedure, Tolerator introduces a two-stage process: (i) sequence fill-up and (ii) iterative refinement by remasking and decoding a subset of tokens while treating the remaining as context. This design enables previously accepted tokens to be reconsidered and corrected when necessary, leading to more reliable diffusion decoding outputs. We evaluate Tolerator on five standard benchmarks covering language understanding, code generation, and mathematics. Experiments show that our method achieves consistent improvements over the baselines under the same computational budget. These findings suggest that decoding algorithms are crucial to realizing the full potential of diffusion large language models. Code and data are publicly available.

Paper Structure

This paper contains 39 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of Tolerator. Compared to the vanilla decoding strategy, we first fill the masked tokens with high parallelism and then iteratively refine the draft through token-level cross-validation. Here, cross-validation means tokens alternately act as the target and the context of prediction. This process allows previously accepted tokens to be revisited and corrected when necessary.
  • Figure 2: Performance-Efficiency Trade-Off for Different Decoding Methods. This figure illustrates the performance of different methods under varying parallel sizes. Gray bars represent generation throughput (tokens per second, TPS). Colored lines show average performance across five benchmarks as forward step $T$ varies.
  • Figure 3: Performance across different benchmarks for different decoding methods. This figure presents the performance of various methods under different benchmarks. Colored bars represent average performance across different forward steps ($T$).
  • Figure 4: Ablation Studies of EoT Penalty. We fix the fill-up and refinement configurations while varying $\lambda_{\text{eot}}$ from 1.0 to 1.3, with results shown for 32 and 128 forward step $T$. Across most tasks, introducing an appropriate EoT penalty substantially improves generation quality. The precise numerical values are reported in Appendix \ref{['app: experimental details']}.
  • Figure 5: Output of Fill-Up Stage.We use colors fading from blue to red to demonstrate the order of decoding.Using fill-up and refinement steps =16, special tokens like [EoT] are not shown.