Table of Contents
Fetching ...

Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct

Haoyang Zheng, Xinyang Liu, Cindy Xiangrui Kong, Nan Jiang, Zheyuan Hu, Weijian Luo, Wei Deng, Guang Lin

TL;DR

DiDi-Instruct tackles the bottleneck of fast language generation by distilling a high-quality pre-trained discrete diffusion LLM (dLLM) into a few-step student via an Integral KL-divergence objective ($IKL$) that matches the teacher's marginal distributions over the diffusion timeline. The approach combines a principled policy-gradient-like gradient (score-function) with a discriminative density-ratio estimator, stabilized by grouped reward normalization and intermediate-state matching, and enhances sampling with reward-guided ancestral sampling (RGAS). Empirically, it delivers state-of-the-art perplexities across 8–128 NFEs on OpenWebText, with up to ~64x faster distillation and competitive zero-shot generalization, and scales effectively to larger models (up to 424M parameters) while preserving entropy and achieving substantial quality gains. The framework also demonstrates applicability to protein sequence generation, suggesting broad utility for rapid, high-quality discrete sequence generation in diverse domains.

Abstract

Fast and high-quality language generation is the holy grail that people pursue in the age of AI. In this work, we introduce Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that initializes from a pre-trained (masked) discrete diffusion language model (dLLM) and distills a few-step student for fast generation. The resulting DiDi-Instruct model achieves comparable or superior performance to its dLLM teacher and the GPT-2 baseline while enabling up to 64$\times$ acceleration. The theoretical foundation of DiDi-Instruct is a novel framework based on integral KL-divergence minimization, which yields a practical training algorithm. We further introduce grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler that significantly improve training stability, model coverage, and inference quality. On OpenWebText, DiDi-Instruct achieves perplexity from 62.2 (8 NFEs) to 18.4 (128 NFEs), which outperforms prior accelerated dLLMs and GPT-2 baseline. These gains come with a negligible entropy loss (around $1\%$) and reduce additional training wall-clock time by more than $20\times$ compared to competing dLLM distillation methods. We further validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling, and the generation of discrete protein sequences. In conclusion, DiDi-Instruct is an efficient yet effective distillation method, enabling language generation in the blink of an eye. We will release both code and models at github.com/haoyangzheng-ai/didi-instruct.

Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct

TL;DR

DiDi-Instruct tackles the bottleneck of fast language generation by distilling a high-quality pre-trained discrete diffusion LLM (dLLM) into a few-step student via an Integral KL-divergence objective () that matches the teacher's marginal distributions over the diffusion timeline. The approach combines a principled policy-gradient-like gradient (score-function) with a discriminative density-ratio estimator, stabilized by grouped reward normalization and intermediate-state matching, and enhances sampling with reward-guided ancestral sampling (RGAS). Empirically, it delivers state-of-the-art perplexities across 8–128 NFEs on OpenWebText, with up to ~64x faster distillation and competitive zero-shot generalization, and scales effectively to larger models (up to 424M parameters) while preserving entropy and achieving substantial quality gains. The framework also demonstrates applicability to protein sequence generation, suggesting broad utility for rapid, high-quality discrete sequence generation in diverse domains.

Abstract

Fast and high-quality language generation is the holy grail that people pursue in the age of AI. In this work, we introduce Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that initializes from a pre-trained (masked) discrete diffusion language model (dLLM) and distills a few-step student for fast generation. The resulting DiDi-Instruct model achieves comparable or superior performance to its dLLM teacher and the GPT-2 baseline while enabling up to 64 acceleration. The theoretical foundation of DiDi-Instruct is a novel framework based on integral KL-divergence minimization, which yields a practical training algorithm. We further introduce grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler that significantly improve training stability, model coverage, and inference quality. On OpenWebText, DiDi-Instruct achieves perplexity from 62.2 (8 NFEs) to 18.4 (128 NFEs), which outperforms prior accelerated dLLMs and GPT-2 baseline. These gains come with a negligible entropy loss (around ) and reduce additional training wall-clock time by more than compared to competing dLLM distillation methods. We further validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling, and the generation of discrete protein sequences. In conclusion, DiDi-Instruct is an efficient yet effective distillation method, enabling language generation in the blink of an eye. We will release both code and models at github.com/haoyangzheng-ai/didi-instruct.

Paper Structure

This paper contains 37 sections, 3 theorems, 28 equations, 7 figures, 6 tables, 2 algorithms.

Key Result

Theorem 4.1

Let the objective $\mathcal{L}(\nu)$ be the weighted integral of KL divergences between student and teacher marginals, $\mathbf{q}_\nu$ and $\mathbf{q}_\theta$. The gradient of the objective admits: where the expectation over time $t$ is sampled from distribution $\pi(t)$, and $R(\mathbf{z}_t,t):=\log \mathbf{q}_\nu(\mathbf{z}_t,t) - \log \mathbf{q}_\theta(\mathbf{z}_t,t)$ denotes the reward (log

Figures (7)

  • Figure 1: Perplexity vs. NFEs.
  • Figure 2: The pipeline of DiDi-Instruct (Algorithm \ref{['alg:dddi']}). Given a fully masked input $\mathbf{z}_t$ ($t=1$), both the student $\mathbf{p}_\nu$ and the teacher $\mathbf{p}_\theta$ produce clean samples $\mathbf x$ and $\mathbf x'$, which are corrupted at $t_i\sim\pi(t)$ to form $\mathbf{z}_i$ and $\mathbf{z}_i'$. The discriminator $D_\lambda$ is trained to classify these outputs, while its reward signal (\ref{['eq:reward:from:discriminator']}) enables the gradient update (\ref{['eq:ikl:discrete:objective']}) for the student. The red line denotes the gradient flow for the student's update step, and the blue line represents the one for the auxiliary model's update step.
  • Figure 3: Perplexity versus latency trade-off, comparing RGAS against AS, the teacher model, and a GPT-2 baseline. The x-axis represents wall-clock latency in seconds per sequence (log scale), and the y-axis represents perplexity. Our method consistently achieves a superior efficiency frontier, reaching lower perplexity than AS at all latency points while approaching the quality of the teacher model with significantly less computational cost.
  • Figure 4: Scaling results for the 424M models. DiDi-Instruct significantly lowers PPL compared to the MDLM baseline across all NFEs.
  • Figure 5: pLDDT comparison between DiDi-Instruct and the DPLM-150M across different sequence lengths ($L=100,200,300,400,500$) and NFEs. Our method consistently outperforms the teacher, achieving up to +10 pLDDT gains at shorter sequence lengths (e.g., $L=100$) and maintaining superior structural confidence across all lengths, even with substantially fewer sampling steps. Moreover, the distilled student exhibits more stable performance across NFEs, whereas the teacher shows larger variability as the number of steps increases.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Theorem 4.1: Score-Function Identity
  • Lemma 4.2: Density Ratio Representation
  • proof : Proof of Theorem \ref{['theorem:score-function']}
  • Remark C.1
  • Lemma D.1: Density Ratio Representation, Restatement of Lemma \ref{['thm:density_ratio_rep']}
  • proof
  • Remark D.2