Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning
Yeongbin Seo, Dongha Lee, Jaehyung Kim, Jinyoung Yeo
TL;DR
This paper tackles the long decoding-window bottleneck in diffusion language models (MDLM) for open-ended text generation. It introduces Convolutional decoding (Conv), a normalization-based method that narrows the decoding window without hard segmentation, and Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme to reduce preferences for repetitive or high-prior tokens at distant positions. Together, Conv and R2FT yield state-of-the-art results among diffusion LMs on AlpacaEval with substantially fewer decoding steps and demonstrate notable speedups via EOS-fill and caching. The work preserves bidirectionality and shows robust gains across both small and large models, highlighting practical improvements for fast, fluent diffusion-based generation. Limitations include limited evaluation on purely bidirectional downstream tasks, motivating future exploration of bidirectional goal-oriented generation with diffusion LMs.
Abstract
Autoregressive (AR) language models generate text one token at a time, which limits their inference speed. Diffusion-based language models offer a promising alternative, as they can decode multiple tokens in parallel. However, we identify a key bottleneck in current diffusion LMs: the long decoding-window problem, where tokens generated far from the input context often become irrelevant or repetitive. Previous solutions like semi-autoregressive address this issue by splitting windows into blocks (sacrificing bidirectionality), but we find that this also leads to time-interval expansion problem, sacrificing the speed. Therefore, semi-AR eliminates the main advantages of diffusion models. To overcome this, we propose Convolutional decoding (Conv), a normalization-based method that narrows the decoding window without hard segmentation, leading to better fluency and flexibility. Additionally, we introduce Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme that better aligns tokens at positions far from context. Our methods achieve state-of-the-art results on open-ended generation benchmarks (e.g., AlpacaEval) among diffusion LM baselines, with significantly lower step size than previous works, demonstrating both speed and quality improvements.
