Table of Contents
Fetching ...

Diffusion Language Models Are Natively Length-Aware

Vittorio Rossi, Giacomo Cirò, Davide Beltrame, Luca Gandolfi, Paul Röttger, Dirk Hovy

TL;DR

This work conjecture that the latent prompt representation contains sufficient information to estimate the required output length and proposes a zero-shot mechanism to dynamically crop the context window before generation begins, leading to fewer diffusion steps and substantial computational savings.

Abstract

Unlike autoregressive language models, which terminate variable-length generation upon predicting an End-of-Sequence (EoS) token, Diffusion Language Models (DLMs) operate over a fixed maximum-length context window for a predetermined number of denoising steps. However, this process is independent of the required response length, resulting in computational waste for the majority of short responses common in reasoning and chat tasks. To address this problem, we conjecture that the latent prompt representation contains sufficient information to estimate the required output length. We provide empirical evidence for this phenomenon and propose a zero-shot mechanism to dynamically crop the context window before generation begins, leading to fewer diffusion steps and substantial computational savings. We evaluate our approach on four benchmarks with diverse tasks -- GSM8K (reasoning), HumanEval (code generation), IfEval (instruction following), and LongFormQA (question answering) -- revealing massive efficiency gains at minimal performance impact. We report significant reductions in FLOPs across all tasks, with no statistically significant performance degradation, and significant performance improvements in 2 out of 4 tasks.

Diffusion Language Models Are Natively Length-Aware

TL;DR

This work conjecture that the latent prompt representation contains sufficient information to estimate the required output length and proposes a zero-shot mechanism to dynamically crop the context window before generation begins, leading to fewer diffusion steps and substantial computational savings.

Abstract

Unlike autoregressive language models, which terminate variable-length generation upon predicting an End-of-Sequence (EoS) token, Diffusion Language Models (DLMs) operate over a fixed maximum-length context window for a predetermined number of denoising steps. However, this process is independent of the required response length, resulting in computational waste for the majority of short responses common in reasoning and chat tasks. To address this problem, we conjecture that the latent prompt representation contains sufficient information to estimate the required output length. We provide empirical evidence for this phenomenon and propose a zero-shot mechanism to dynamically crop the context window before generation begins, leading to fewer diffusion steps and substantial computational savings. We evaluate our approach on four benchmarks with diverse tasks -- GSM8K (reasoning), HumanEval (code generation), IfEval (instruction following), and LongFormQA (question answering) -- revealing massive efficiency gains at minimal performance impact. We report significant reductions in FLOPs across all tasks, with no statistically significant performance degradation, and significant performance improvements in 2 out of 4 tasks.
Paper Structure (36 sections, 4 equations, 10 figures, 2 tables)

This paper contains 36 sections, 4 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Predicted Length Distributions. Our SmartCrop ($\tau = 0.9$) method successfully predicts task-specific output lengths across four benchmark datasets. The abrupt truncations observed in certain distributions correspond to context length constraints (refer to Section \ref{['sec:experiments']} for details).
  • Figure 2: Sensitivity of IfEval Performance to Context Length Perturbations. We analyze the robustness of SmartCrop ($\tau=0.9$) by shifting the predicted length $\hat{L}$ by a deviation factor $\delta \in [-50\%, +50\%]$. The blue curve shows the model performance (mean $\pm$ 95% CI) across these varying context lengths. The red line denotes the control baseline, where lengths are sampled from the empirical length distribution of other benchmarks. The green line represents the Full Context baseline performance. While the model is relatively robust to moderate under-estimation (negative $\delta$), generation quality degrades as superfluous padding is reintroduced (positive $\delta$), eventually converging toward the baseline.
  • Figure 3: Predicted Length Invariance (HumanEval). Left: kernel density estimate of predicted new tokens for $L_{\text{new}} \in \{512,1024,2048,4096\}$. Right: boxplots of the same values. The bulk of the distribution is comparatively stable across $L_{\text{new}}$, with the main visible difference being a stronger right truncation when $L_{\text{new}}=512$, which is expected when the required completion length approaches the canvas limit. Note: $L_{\text{new}}=256$ causes the predicted length distribution to be heavily truncated and uninformative compared to larger canvases.
  • Figure 4: Predicted Length Invariance (IfEval). Left: kernel density estimate of predicted new tokens for $L_{\text{new}} \in \{256,512,1024,2048,4096\}$. Right: boxplots of the same values. The central mass of the predicted-length distribution (roughly 50--150 new tokens) is broadly consistent across $L_{\text{new}}$, while larger canvases primarily increase the range of rare long-length outliers.
  • Figure 5: Predicted Length Invariance (LongFormQA). Left: kernel density estimate of predicted new tokens for $L_{\text{new}} \in \{256,512,1024,2048,4096\}$. Right: boxplots of the same values. The predicted length is close to invariant across $L_{\text{new}}$ for the typical range of outputs, with only small shifts in the median and dispersion. This supports the claim in the main text that, for LongFormQA, the model's inferred length prior is largely insensitive to the particular (potentially conservative) initial canvas size used for the first forward pass.
  • ...and 5 more figures