Table of Contents
Fetching ...

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

Yeongbin Seo, Dongha Lee, Jaehyung Kim, Jinyoung Yeo

TL;DR

This paper tackles the long decoding-window bottleneck in diffusion language models (MDLM) for open-ended text generation. It introduces Convolutional decoding (Conv), a normalization-based method that narrows the decoding window without hard segmentation, and Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme to reduce preferences for repetitive or high-prior tokens at distant positions. Together, Conv and R2FT yield state-of-the-art results among diffusion LMs on AlpacaEval with substantially fewer decoding steps and demonstrate notable speedups via EOS-fill and caching. The work preserves bidirectionality and shows robust gains across both small and large models, highlighting practical improvements for fast, fluent diffusion-based generation. Limitations include limited evaluation on purely bidirectional downstream tasks, motivating future exploration of bidirectional goal-oriented generation with diffusion LMs.

Abstract

Autoregressive (AR) language models generate text one token at a time, which limits their inference speed. Diffusion-based language models offer a promising alternative, as they can decode multiple tokens in parallel. However, we identify a key bottleneck in current diffusion LMs: the long decoding-window problem, where tokens generated far from the input context often become irrelevant or repetitive. Previous solutions like semi-autoregressive address this issue by splitting windows into blocks (sacrificing bidirectionality), but we find that this also leads to time-interval expansion problem, sacrificing the speed. Therefore, semi-AR eliminates the main advantages of diffusion models. To overcome this, we propose Convolutional decoding (Conv), a normalization-based method that narrows the decoding window without hard segmentation, leading to better fluency and flexibility. Additionally, we introduce Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme that better aligns tokens at positions far from context. Our methods achieve state-of-the-art results on open-ended generation benchmarks (e.g., AlpacaEval) among diffusion LM baselines, with significantly lower step size than previous works, demonstrating both speed and quality improvements.

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

TL;DR

This paper tackles the long decoding-window bottleneck in diffusion language models (MDLM) for open-ended text generation. It introduces Convolutional decoding (Conv), a normalization-based method that narrows the decoding window without hard segmentation, and Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme to reduce preferences for repetitive or high-prior tokens at distant positions. Together, Conv and R2FT yield state-of-the-art results among diffusion LMs on AlpacaEval with substantially fewer decoding steps and demonstrate notable speedups via EOS-fill and caching. The work preserves bidirectionality and shows robust gains across both small and large models, highlighting practical improvements for fast, fluent diffusion-based generation. Limitations include limited evaluation on purely bidirectional downstream tasks, motivating future exploration of bidirectional goal-oriented generation with diffusion LMs.

Abstract

Autoregressive (AR) language models generate text one token at a time, which limits their inference speed. Diffusion-based language models offer a promising alternative, as they can decode multiple tokens in parallel. However, we identify a key bottleneck in current diffusion LMs: the long decoding-window problem, where tokens generated far from the input context often become irrelevant or repetitive. Previous solutions like semi-autoregressive address this issue by splitting windows into blocks (sacrificing bidirectionality), but we find that this also leads to time-interval expansion problem, sacrificing the speed. Therefore, semi-AR eliminates the main advantages of diffusion models. To overcome this, we propose Convolutional decoding (Conv), a normalization-based method that narrows the decoding window without hard segmentation, leading to better fluency and flexibility. Additionally, we introduce Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme that better aligns tokens at positions far from context. Our methods achieve state-of-the-art results on open-ended generation benchmarks (e.g., AlpacaEval) among diffusion LM baselines, with significantly lower step size than previous works, demonstrating both speed and quality improvements.

Paper Structure

This paper contains 83 sections, 13 equations, 15 figures, 10 tables, 1 algorithm.

Figures (15)

  • Figure 1: G-eval score and sampling speed on AlpacaEval. Ours achieves SOTA.
  • Figure 2: Candidate zone of the first inference step given the previous context “Q: Who is Abraham Lincoln? A:”. The left box shows positions close to the context (0 to 4), and the right box shows positions relatively far from the context (25 to 29). Darker red indicates higher confidence of the model. Informative token for given question (e.g., "president", "United States" ) are outlined.
  • Figure 3: X-axis is distance from instruction prompt, Y-axis is summed probability of high-prior and repetition.
  • Figure 4: Perplexity (y-axis) of the text samples from pretrained MDLM, applying semi-AR decoding with different block sizes (x-axis) on fixed $L = 1024$. Each line corresponds to a fixed $S$.
  • Figure 5: Convolution normalizer ($s_i$) for each position across decoding window, given bidirectional context.
  • ...and 10 more figures