Table of Contents
Fetching ...

Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Feng Hong, Geng Yu, Yushi Ye, Haicheng Huang, Huangjie Zheng, Ya Zhang, Yanfeng Wang, Jiangchao Yao

TL;DR

The paper tackles the quality-speed trade-off in Diffusion Large Language Models by addressing decoding irreversibility. It introduces WINO, a training-free revokable decoding algorithm that performs parallel drafting and context-aware verification using a shadow block to refine early tokens. Empirical results on open-source DLLMs (LLaDA and MMaDA) show significant speedups (up to 6–10x) with simultaneous accuracy gains across language and multimodal tasks, validating the approach. The work suggests future directions to integrate revokable sampling into training for even greater performance and efficiency gains.

Abstract

Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model's bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6$\times$ while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10$\times$ speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.

Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

TL;DR

The paper tackles the quality-speed trade-off in Diffusion Large Language Models by addressing decoding irreversibility. It introduces WINO, a training-free revokable decoding algorithm that performs parallel drafting and context-aware verification using a shadow block to refine early tokens. Empirical results on open-source DLLMs (LLaDA and MMaDA) show significant speedups (up to 6–10x) with simultaneous accuracy gains across language and multimodal tasks, validating the approach. The work suggests future directions to integrate revokable sampling into training for even greater performance and efficiency gains.

Abstract

Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model's bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6 while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10 speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.

Paper Structure

This paper contains 16 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Demonstration of speedup and performance improvement of WINO over standard decoding and naive parallel sampling evaluated on GSM8K with LLaDA and Flickr30K with MMaDA. The standard decoding unmasks 1 token per decoding step, while the naive parallel sampling unmasks $M(>1)$ tokens per decoding step. We set $M=4$ for GSM8K and $M=8$ for Flickr30K.
  • Figure 2: (a) An overview of WINO. (b) Illustration of our designed attention mask. The green squares denote 1, the grey squares denote 0, and "Pos ID" is short for position ID. Verified tokens refer to tokens in the prompt $X$ or previously decoded blocks. Draft tokens denote tokens in the current block that are unmasked up to the current decoding step. [draw=gray,thick,inner sep=2pt]test [MASK] (shadow draft) refer to tokens in the shadow block whose position IDs correspond to the draft tokens while [draw=gray,thick,inner sep=2pt]test [MASK] (shadow mask) refer to the remaining tokens in the shadow block.
  • Figure 3: Decoding steps of WINO on subsets of the MATH benchmark with varied difficulty levels.
  • Figure 4: Ablation study on the drafting threshold $\tau_1$ and the verification threshold $\tau_2$.
  • Figure 5: Case Study: GSM8K Example. We compare standard decoding with LLaDA against the intermediate and final results produced by WINO. More detailed case studies are provided in \ref{['appendix: case']}.
  • ...and 1 more figures