Table of Contents
Fetching ...

IDLM: Inverse-distilled Diffusion Language Models

David Li, Nikita Gushchin, Dmitry Abulkhanov, Eric Moulines, Ivan Oseledets, Maxim Panov, Alexander Korotin

TL;DR

This work extends Inverse Distillation, a technique originally developed to accelerate continuous diffusion models to the discrete setting, and introduces gradient-stable relaxations to support effective training.

Abstract

Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. To address this, we extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. Nonetheless, this extension introduces both theoretical and practical challenges. From a theoretical perspective, the inverse distillation objective lacks uniqueness guarantees, which may lead to suboptimal solutions. From a practical standpoint, backpropagation in the discrete space is non-trivial and often unstable. To overcome these challenges, we first provide a theoretical result demonstrating that our inverse formulation admits a unique solution, thereby ensuring valid optimization. We then introduce gradient-stable relaxations to support effective training. As a result, experiments on multiple DLMs show that our method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4x-64x, while preserving the teacher model's entropy and generative perplexity.

IDLM: Inverse-distilled Diffusion Language Models

TL;DR

This work extends Inverse Distillation, a technique originally developed to accelerate continuous diffusion models to the discrete setting, and introduces gradient-stable relaxations to support effective training.

Abstract

Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. To address this, we extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. Nonetheless, this extension introduces both theoretical and practical challenges. From a theoretical perspective, the inverse distillation objective lacks uniqueness guarantees, which may lead to suboptimal solutions. From a practical standpoint, backpropagation in the discrete space is non-trivial and often unstable. To overcome these challenges, we first provide a theoretical result demonstrating that our inverse formulation admits a unique solution, thereby ensuring valid optimization. We then introduce gradient-stable relaxations to support effective training. As a result, experiments on multiple DLMs show that our method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4x-64x, while preserving the teacher model's entropy and generative perplexity.
Paper Structure (26 sections, 4 theorems, 66 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 4 theorems, 66 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

For the SEDD eq:sedd loss, MDLM eq:mdlm loss, and Duo eq:duo loss (in the limit as $\tau \to 0^{+}$) objectives the IDLM loss defined in equation eq:inverse discrete diffusion distillation satisfies and achieves its minimum (zero) if and only if the model distribution matches the target distribution

Figures (5)

  • Figure 1: An overview of Diffusion Language Models (DLMs) (top) and the proposed Inverse-distilled Diffusion Language Models (IDLM) (bottom). DLMs train a denoising network $f^*$ using samples from the target data distribution $p^*(x_0)$. In contrast, our approach leverages inverse distillation: given $f^*$, it learns a model $p_{\theta}(x_0)$ that approximates the target data distribution $p^*(x_0)$ used to train $f^*$. While standard DLMs exhibit strong generative performance, they typically require the large number of sampling steps. IDLM instead trains a few-step generator that maintains generative quality while substantially accelerating inference.
  • Figure 2: Overview of the Inverse-distilled Diffusion Language Model (IDLM) Framework (\ref{['sec:idlm']}). The objective is to distill a pretrained Diffusion Language Model $f^*$ (blue), referred to as the teacher, into a few-step generator $G_{\theta}$ (red), referred to as the student. To achieve this, we extend the concept of Inverse Distillation, formulating the training procedure as a nested optimization problem. Specifically, an auxiliary diffusion model $\widehat{f}$ (green), termed the fake model, is first optimized with respect to the loss $\mathcal{L}_{\text{discr.}}(\widehat{f}, x_0)$ (see equation \ref{['eq:update fake in practice']}). Subsequently, the generator parameters are updated using our proposed IDLM objective, $\mathcal{L}_{\text{IDLM}}(\theta)$ (see equation \ref{['eq:update generator in practice']}). To further enhance generation quality, we incorporate a multistep distillation strategy (gray box) into the training pipeline (\ref{['sec:technical aspects']}). The complete procedure is summarized in Algorithm \ref{['alg:idlm']}.
  • Figure 3: IDLM-SEDD comparison with SEDD.Top: IDLM-SEDD (Ours) matches SEDD generation quality and diversity with far fewer steps, reducing the number of steps from $1024\rightarrow256$ ($4\times$). Bottom: GenPPL vs. sampling steps shows that IDLM‑SEDD consistently outperforms the original SEDD model across all sampling steps, without significantly sacrificing entropy.
  • Figure 4: IDLM-MDLM comparison with MDLM and SDTT.Top: IDLM-MDLM (Ours) matches MDLM/SDTT generation quality and diversity with far fewer steps, reducing sampling steps for MDLM from $1024 \rightarrow 16$ ($64\times$). Bottom: GenPPL vs. sampling steps shows that IDLM-MDLM performs best in the low-step regime with high entropy.
  • Figure 5: IDLM-Duo/IDLM-DCD comparison with Duo and Duo-DCD.Top: Under both Ancestral ($^{\mathrm{a}}$) and Greedy‑Tail ($^{\mathrm{g}}$) sampling, our distilled models maintain GenPPL and entropy while requiring significantly fewer sampling steps. Compared to the original Duo model, IDLM‑Duo (Ours) reduces the steps from $1024$ to $16$ ($64\times$), and IDLM‑DCD (Ours) further reduces them to $8$ steps ($128\times$) under Ancestral sampling and to $4$ steps ($256\times$) under Greedy‑Tail sampling. Bottom: GenPPL vs. sampling steps shows that IDLM-DCD achieves the lowest GenPPL across all sampling steps under both samplers, while maintaining comparable entropy. IDLM-Duo also improves upon Duo and achieves metrics comparable to Duo-DCD under Ancestral sampling, but exhibits a modest GenPPL gap relative to Duo-DCD under Greedy-Tail sampling in low-step regime.

Theorems & Definitions (7)

  • Theorem 3.1: Unique solution
  • Theorem 3.1: Equivalence to the SEDD Loss
  • proof
  • Proposition 3.2
  • proof
  • Theorem 3.3
  • proof