Table of Contents
Fetching ...

Sequential Diffusion Language Models

Yangzhou Liu, Yue Cao, Hao Li, Gen Luo, Zhe Chen, Weiyun Wang, Xiaobo Liang, Biqing Qi, Lijun Wu, Changyao Tian, Yanting Zhang, Yuqiang Li, Tong Lu, Yu Qiao, Jifeng Dai, Wenhai Wang

TL;DR

Diffusion language models struggle with fixed decoding length and KV-cache use, limiting practicality. The authors propose Next Sequence Prediction (NSP) to unify next-token and next-block generation, and develop Sequential Diffusion Language Models (SDLM) that retrofit pretrained ALMs via parallel block training and dynamic, confidence-guided decoding. SDLM employs a longest-prefix decoding strategy and self-speculative verification to adapt output length while preserving KV-cache compatibility, achieving competitive performance with far fewer training samples and notable throughput gains. The approach scales effectively to larger models (SDLM-32B) and demonstrates a strong speed–quality trade-off across diverse benchmarks, underscoring the potential of NSP-based diffusion methods for efficient, scalable language modeling.

Abstract

Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value (KV) caches. Block diffusion mitigates these issues, yet still enforces a fixed block size and requires expensive training. We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction, enabling the model to adaptively determine the generation length at each step. When the length is fixed to 1, NSP reduces to standard next-token prediction. Building on NSP, we propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost. Specifically, SDLM performs diffusion inference within fixed-size mask blocks, but dynamically decodes consecutive subsequences based on model confidence, thereby preserving KV-cache compatibility and improving robustness to varying uncertainty and semantics across the sequence. Experiments show that SDLM matches or surpasses strong autoregressive baselines using only 3.5M training samples, while achieving 2.1 higher throughput than Qwen-2.5. Notably, the SDLM-32B model delivers even more pronounced efficiency gains, demonstrating the strong scalability potential of our modeling paradigm. Project page and codes: https://github.com/OpenGVLab/SDLM

Sequential Diffusion Language Models

TL;DR

Diffusion language models struggle with fixed decoding length and KV-cache use, limiting practicality. The authors propose Next Sequence Prediction (NSP) to unify next-token and next-block generation, and develop Sequential Diffusion Language Models (SDLM) that retrofit pretrained ALMs via parallel block training and dynamic, confidence-guided decoding. SDLM employs a longest-prefix decoding strategy and self-speculative verification to adapt output length while preserving KV-cache compatibility, achieving competitive performance with far fewer training samples and notable throughput gains. The approach scales effectively to larger models (SDLM-32B) and demonstrates a strong speed–quality trade-off across diverse benchmarks, underscoring the potential of NSP-based diffusion methods for efficient, scalable language modeling.

Abstract

Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value (KV) caches. Block diffusion mitigates these issues, yet still enforces a fixed block size and requires expensive training. We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction, enabling the model to adaptively determine the generation length at each step. When the length is fixed to 1, NSP reduces to standard next-token prediction. Building on NSP, we propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost. Specifically, SDLM performs diffusion inference within fixed-size mask blocks, but dynamically decodes consecutive subsequences based on model confidence, thereby preserving KV-cache compatibility and improving robustness to varying uncertainty and semantics across the sequence. Experiments show that SDLM matches or surpasses strong autoregressive baselines using only 3.5M training samples, while achieving 2.1 higher throughput than Qwen-2.5. Notably, the SDLM-32B model delivers even more pronounced efficiency gains, demonstrating the strong scalability potential of our modeling paradigm. Project page and codes: https://github.com/OpenGVLab/SDLM

Paper Structure

This paper contains 26 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of decoding paradigms. (a) ALMs: decode one token at a time. (b) DLMs (e.g. Block Diffusion): decode all tokens in a fixed block before moving to the next. (c) SDLM (Ours): dynamically predicts a contiguous subsequence within a fixed block. (d) Performance vs. Speed: MATH-500 results showing trade-off between speed (TPS) and accuracy.
  • Figure 2: Structured attention mask for parallel block training and sampling. (a) Reordered input yields a mask with causal prefix (top-left), visible cross-block prefix (bottom-left), and intra-block bidirectional attention (bottom-right). (b) Confidence-based next sequence prediction with KV reuse. A block of $D$ tokens is predicted with $D{-}1$ masks. The longest high-confidence subsequence is selected as dynamic output. Cached KV states enable efficient decoding.
  • Figure 3: Trade-off between performance and speed under different inference setting for SDLM-3B $(D=4)$ and SDLM-3B $(D=8)$. Adjusting $\tau$ allows a controllable trade-off between speed and performance. SpeedUp denotes the average number of tokens output per forward pass.
  • Figure 4: Ablation on attention mask type and prediction shift strategy. We conduct the following ablation experiments: (1) No shift: predicting $x_t$ instead of $x_{t+1}$; (2) Leisure precautions: using a causal mask instead. The left image shows its model performance, while the right image shows the acceleration ratio.
  • Figure 5: Visualization of the sampling process. Where each blue block indicates a subsequence generated in a single decoding step.