IDLM: Inverse-distilled Diffusion Language Models

David Li; Nikita Gushchin; Dmitry Abulkhanov; Eric Moulines; Ivan Oseledets; Maxim Panov; Alexander Korotin

IDLM: Inverse-distilled Diffusion Language Models

David Li, Nikita Gushchin, Dmitry Abulkhanov, Eric Moulines, Ivan Oseledets, Maxim Panov, Alexander Korotin

TL;DR

This work extends Inverse Distillation, a technique originally developed to accelerate continuous diffusion models to the discrete setting, and introduces gradient-stable relaxations to support effective training.

Abstract

Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. To address this, we extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. Nonetheless, this extension introduces both theoretical and practical challenges. From a theoretical perspective, the inverse distillation objective lacks uniqueness guarantees, which may lead to suboptimal solutions. From a practical standpoint, backpropagation in the discrete space is non-trivial and often unstable. To overcome these challenges, we first provide a theoretical result demonstrating that our inverse formulation admits a unique solution, thereby ensuring valid optimization. We then introduce gradient-stable relaxations to support effective training. As a result, experiments on multiple DLMs show that our method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4x-64x, while preserving the teacher model's entropy and generative perplexity.

IDLM: Inverse-distilled Diffusion Language Models

TL;DR

Abstract

Paper Structure (26 sections, 4 theorems, 66 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 4 theorems, 66 equations, 5 figures, 1 table, 1 algorithm.

Introduction
Preliminaries
Diffusion Language Models
Distillation of Diffusion Models
Inverse-distilled Diffusion Language Models
Theoretical extension of Inverse Distillation
Practical extension of Inverse Distillation
Technical Aspects
Experiments
Discussion and Conclusion
Limitations and Future work.
Related Works
Acceleration of Discrete Diffusion Language Models.
Continuous Relaxations and Flow Matching.
Distillation of Diffusion Models.
...and 11 more sections

Key Result

Theorem 3.1

For the SEDD eq:sedd loss, MDLM eq:mdlm loss, and Duo eq:duo loss (in the limit as $\tau \to 0^{+}$) objectives the IDLM loss defined in equation eq:inverse discrete diffusion distillation satisfies and achieves its minimum (zero) if and only if the model distribution matches the target distribution

Figures (5)

Figure 1: An overview of Diffusion Language Models (DLMs) (top) and the proposed Inverse-distilled Diffusion Language Models (IDLM) (bottom). DLMs train a denoising network $f^*$ using samples from the target data distribution $p^*(x_0)$. In contrast, our approach leverages inverse distillation: given $f^*$, it learns a model $p_{\theta}(x_0)$ that approximates the target data distribution $p^*(x_0)$ used to train $f^*$. While standard DLMs exhibit strong generative performance, they typically require the large number of sampling steps. IDLM instead trains a few-step generator that maintains generative quality while substantially accelerating inference.
Figure 2: Overview of the Inverse-distilled Diffusion Language Model (IDLM) Framework (\ref{['sec:idlm']}). The objective is to distill a pretrained Diffusion Language Model $f^*$ (blue), referred to as the teacher, into a few-step generator $G_{\theta}$ (red), referred to as the student. To achieve this, we extend the concept of Inverse Distillation, formulating the training procedure as a nested optimization problem. Specifically, an auxiliary diffusion model $\widehat{f}$ (green), termed the fake model, is first optimized with respect to the loss $\mathcal{L}_{\text{discr.}}(\widehat{f}, x_0)$ (see equation \ref{['eq:update fake in practice']}). Subsequently, the generator parameters are updated using our proposed IDLM objective, $\mathcal{L}_{\text{IDLM}}(\theta)$ (see equation \ref{['eq:update generator in practice']}). To further enhance generation quality, we incorporate a multistep distillation strategy (gray box) into the training pipeline (\ref{['sec:technical aspects']}). The complete procedure is summarized in Algorithm \ref{['alg:idlm']}.
Figure 3: IDLM-SEDD comparison with SEDD.Top: IDLM-SEDD (Ours) matches SEDD generation quality and diversity with far fewer steps, reducing the number of steps from $1024\rightarrow256$ ($4\times$). Bottom: GenPPL vs. sampling steps shows that IDLM‑SEDD consistently outperforms the original SEDD model across all sampling steps, without significantly sacrificing entropy.
Figure 4: IDLM-MDLM comparison with MDLM and SDTT.Top: IDLM-MDLM (Ours) matches MDLM/SDTT generation quality and diversity with far fewer steps, reducing sampling steps for MDLM from $1024 \rightarrow 16$ ($64\times$). Bottom: GenPPL vs. sampling steps shows that IDLM-MDLM performs best in the low-step regime with high entropy.
Figure 5: IDLM-Duo/IDLM-DCD comparison with Duo and Duo-DCD.Top: Under both Ancestral ($^{\mathrm{a}}$) and Greedy‑Tail ($^{\mathrm{g}}$) sampling, our distilled models maintain GenPPL and entropy while requiring significantly fewer sampling steps. Compared to the original Duo model, IDLM‑Duo (Ours) reduces the steps from $1024$ to $16$ ($64\times$), and IDLM‑DCD (Ours) further reduces them to $8$ steps ($128\times$) under Ancestral sampling and to $4$ steps ($256\times$) under Greedy‑Tail sampling. Bottom: GenPPL vs. sampling steps shows that IDLM-DCD achieves the lowest GenPPL across all sampling steps under both samplers, while maintaining comparable entropy. IDLM-Duo also improves upon Duo and achieves metrics comparable to Duo-DCD under Ancestral sampling, but exhibits a modest GenPPL gap relative to Duo-DCD under Greedy-Tail sampling in low-step regime.

Theorems & Definitions (7)

Theorem 3.1: Unique solution
Theorem 3.1: Equivalence to the SEDD Loss
proof
Proposition 3.2
proof
Theorem 3.3
proof

IDLM: Inverse-distilled Diffusion Language Models

TL;DR

Abstract

IDLM: Inverse-distilled Diffusion Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (7)