Table of Contents
Fetching ...

Diffusion Language Models Generation Can Be Halted Early

Sofia Maria Lo Cicero Vaina, Nikita Balagansky, Daniil Gavrilov

TL;DR

This work investigates speeding diffusion-based language generation by applying adaptive early exiting to diffusion language models (DLMs). By estimating generation completeness and halting when appropriate, the approach allows more steps within a time budget and reduces sampling time by up to 40% without compromising sample quality on several DLMs (DDLM, SSD, Plaid). The study analyzes multiple exit criteria (entropy, KL divergence, and token-switch patience) and demonstrates model-dependent effectiveness, with DDLM showing the strongest gains and Plaid being less responsive to adaptive exiting. The findings highlight the value of dynamic process analysis for DLMs and offer practical guidance for accelerating diffusion-based text generation while informing future model design and evaluation.

Abstract

Diffusion Language models (DLMs) are a promising avenue for text generation due to their practical properties on tractable controllable generation. They also have the advantage of not having to predict text autoregressively. However, despite these notable features, DLMs have not yet reached the performance levels of their autoregressive counterparts. One of the ways to reduce the performance gap between these two types of language models is to speed up the generation of DLMs. Therefore, we propose a novel methodology to address this issue in this work. It enables the execution of more generation steps within a given time frame, leading to higher-quality outputs. Specifically, our methods estimate DLMs completeness of text generation and allow adaptive halting of the generation process. We evaluate our methods on Plaid, SSD, and CDCD DLMs and create a cohesive perspective on their generation workflows. Finally, we confirm that our methods allow halting these models and decrease the generation time by $10$-$40$\% without a drop in the quality of model samples.

Diffusion Language Models Generation Can Be Halted Early

TL;DR

This work investigates speeding diffusion-based language generation by applying adaptive early exiting to diffusion language models (DLMs). By estimating generation completeness and halting when appropriate, the approach allows more steps within a time budget and reduces sampling time by up to 40% without compromising sample quality on several DLMs (DDLM, SSD, Plaid). The study analyzes multiple exit criteria (entropy, KL divergence, and token-switch patience) and demonstrates model-dependent effectiveness, with DDLM showing the strongest gains and Plaid being less responsive to adaptive exiting. The findings highlight the value of dynamic process analysis for DLMs and offer practical guidance for accelerating diffusion-based text generation while informing future model design and evaluation.

Abstract

Diffusion Language models (DLMs) are a promising avenue for text generation due to their practical properties on tractable controllable generation. They also have the advantage of not having to predict text autoregressively. However, despite these notable features, DLMs have not yet reached the performance levels of their autoregressive counterparts. One of the ways to reduce the performance gap between these two types of language models is to speed up the generation of DLMs. Therefore, we propose a novel methodology to address this issue in this work. It enables the execution of more generation steps within a given time frame, leading to higher-quality outputs. Specifically, our methods estimate DLMs completeness of text generation and allow adaptive halting of the generation process. We evaluate our methods on Plaid, SSD, and CDCD DLMs and create a cohesive perspective on their generation workflows. Finally, we confirm that our methods allow halting these models and decrease the generation time by -\% without a drop in the quality of model samples.
Paper Structure (27 sections, 8 figures, 7 tables, 3 algorithms)

This paper contains 27 sections, 8 figures, 7 tables, 3 algorithms.

Figures (8)

  • Figure 1: (a) The number of token switches and (b) the entropy of $p({\bm{x}}|{\bm{X}}(t), t)$. Color represents the training step, while the x-axis is the diffusion generation step. The trained model reaches the minimum entropy value before the generation process ends, and the resulting samples remain unchanged. This result indicates the possibility of performing an Early Exit from DLM generation without losing the quality of samples. See Section \ref{['section:emerging']} for more details.
  • Figure 2: (a) The L2 norm of embeddings $||\hat{{\bm{X}}}_0||_2(t)$, (b) the L2 norm of embeddings $||{\bm{X}}||_2(t)$, (c) $\cos$ of the angle between score estimation $\hat{{\bm{S}}}$ and final score in the end of generation, and (d) $\cos$ of the angle between embedding $x$ and final embedding in the end of generation. Color represents the training step, while the x-axis is the diffusion generation step. Beyond step 100, the change in scoring angle ceases, suggesting the model has determined the optimal direction for enhancing the embedding midway through generation. See Section \ref{['section:emerging']} for more details.
  • Figure 3: The L2 norm of embeddings $||{\bm{X}}||_2(t)$ during the generation process for different initial scales of $||{\bm{X}}||_2$ for DDLM. Color represents the initial noise scale, while the x-axis is the diffusion generation step. A lower initial noise scale allows us to reach a minimum of the $||{\bm{X}}||_2$ L2 norm faster, indicating the dependence of Early Exiting performance with the initial noise scale. See Section \ref{['section:early_exit']} for more details.
  • Figure 4: (a) Entropy, (b) unchanged step count, and (c) KL-Divergence are used for different criteria in DDLM, SSD, and Plaid. Generation is halted when the threshold values are met. DDLM reaches the threshold early on, while SSD does so later. The result indicates that DDLM could allow early stopping in text generation. SSD reaches a stopping point after 800 steps of the total 1000. In contrast, Plaid's entropy decreases steadily, and other measures stay the same, hinting that it might not perform well with adaptive early stopping techniques (though still capable of performing a fixed step halting). See Section \ref{['section:method']} for more details.
  • Figure 5: (a) AR-NLL for the different exit criteria with DDLM, (b) SSD, and (c) Plaid with 1k samples of the C4 validation set. Our research shows that DDLM can effectively use adaptive early exiting strategies after step $600$, with the KL criterion allowing an exit $50$ steps earlier than other criteria (including fixed step exit) without loss in quality. The SSD model benefits modestly, with early exits saving about $10$ steps compared to different criteria and exiting after $850$-th step. Plaid lacks adaptive exiting effectiveness, with fixed criteria suggesting possible stops after step $900$ for computational efficiency. Overall, these approaches speed up text generation by up to $40$% for DDLM, $10$-$15$% for SSD, and $10$% for Plaid, enhancing generation speed or sample quality. Despite differences in exit strategies, sample diversity remains unaffected, as indicated in Figure \ref{['fig:uniq']}. See Section \ref{['section:dots']} for more details.
  • ...and 3 more figures