Sequential Diffusion Language Models
Yangzhou Liu, Yue Cao, Hao Li, Gen Luo, Zhe Chen, Weiyun Wang, Xiaobo Liang, Biqing Qi, Lijun Wu, Changyao Tian, Yanting Zhang, Yuqiang Li, Tong Lu, Yu Qiao, Jifeng Dai, Wenhai Wang
TL;DR
Diffusion language models struggle with fixed decoding length and KV-cache use, limiting practicality. The authors propose Next Sequence Prediction (NSP) to unify next-token and next-block generation, and develop Sequential Diffusion Language Models (SDLM) that retrofit pretrained ALMs via parallel block training and dynamic, confidence-guided decoding. SDLM employs a longest-prefix decoding strategy and self-speculative verification to adapt output length while preserving KV-cache compatibility, achieving competitive performance with far fewer training samples and notable throughput gains. The approach scales effectively to larger models (SDLM-32B) and demonstrates a strong speed–quality trade-off across diverse benchmarks, underscoring the potential of NSP-based diffusion methods for efficient, scalable language modeling.
Abstract
Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value (KV) caches. Block diffusion mitigates these issues, yet still enforces a fixed block size and requires expensive training. We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction, enabling the model to adaptively determine the generation length at each step. When the length is fixed to 1, NSP reduces to standard next-token prediction. Building on NSP, we propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost. Specifically, SDLM performs diffusion inference within fixed-size mask blocks, but dynamically decodes consecutive subsequences based on model confidence, thereby preserving KV-cache compatibility and improving robustness to varying uncertainty and semantics across the sequence. Experiments show that SDLM matches or surpasses strong autoregressive baselines using only 3.5M training samples, while achieving 2.1 higher throughput than Qwen-2.5. Notably, the SDLM-32B model delivers even more pronounced efficiency gains, demonstrating the strong scalability potential of our modeling paradigm. Project page and codes: https://github.com/OpenGVLab/SDLM
