Table of Contents
Fetching ...

Don't Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation

Woojin Kim, Jaeyoung Do

TL;DR

This work identifies update-forgetting as a key bottleneck in diffusion language models, where uniform, context-agnostic updates erase previously applied classifier-guided edits and degrade fluency. It introduces Token Timestep Allocation (TTA-Diffusion), an inference-time framework that assigns per-token timesteps to implement soft, semantic token ordering, thereby preserving edits while reducing unnecessary updates. The method includes both fixed and adaptive schedules, with adaptive allocation guided by classifier gradients and a smoothing mechanism to balance stability and fluency. Empirical results on detoxification and sentiment control show substantial gains in controllability and fluency with significantly fewer diffusion steps, and the approach generalizes across discrete, continuous, and progressive-step-reduction settings. Overall, TTA-Diffusion offers a principled, inference-time mechanism to enforce token-level ordering and improve efficiency in diffusion-based text generation.

Abstract

While diffusion language models (DLMs) enable fine-grained refinement, their practical controllability remains fragile. We identify and formally characterize a central failure mode called update forgetting, in which uniform and context agnostic updates induce token level fluctuations across timesteps, erasing earlier semantic edits and disrupting the cumulative refinement process, thereby degrading fluency and coherence. As this failure originates in uniform and context agnostic updates, effective control demands explicit token ordering. We propose Token Timestep Allocation (TTA), which realizes soft and semantic token ordering via per token timestep schedules: critical tokens are frozen early, while uncertain tokens receive continued refinement. This timestep based ordering can be instantiated as either a fixed policy or an adaptive policy driven by task signals, thereby supporting a broad spectrum of refinement strategies. Because it operates purely at inference time, it applies uniformly across various DLMs and naturally extends to diverse supervision sources. Empirically, TTA improves controllability and fluency: on sentiment control, it yields more than 20 percent higher accuracy and nearly halves perplexity using less than one fifth the steps; in detoxification, it lowers maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0). Together, these results demonstrate that softened ordering via timestep allocation is the critical lever for mitigating update forgetting and achieving stable and controllable diffusion text generation.

Don't Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation

TL;DR

This work identifies update-forgetting as a key bottleneck in diffusion language models, where uniform, context-agnostic updates erase previously applied classifier-guided edits and degrade fluency. It introduces Token Timestep Allocation (TTA-Diffusion), an inference-time framework that assigns per-token timesteps to implement soft, semantic token ordering, thereby preserving edits while reducing unnecessary updates. The method includes both fixed and adaptive schedules, with adaptive allocation guided by classifier gradients and a smoothing mechanism to balance stability and fluency. Empirical results on detoxification and sentiment control show substantial gains in controllability and fluency with significantly fewer diffusion steps, and the approach generalizes across discrete, continuous, and progressive-step-reduction settings. Overall, TTA-Diffusion offers a principled, inference-time mechanism to enforce token-level ordering and improve efficiency in diffusion-based text generation.

Abstract

While diffusion language models (DLMs) enable fine-grained refinement, their practical controllability remains fragile. We identify and formally characterize a central failure mode called update forgetting, in which uniform and context agnostic updates induce token level fluctuations across timesteps, erasing earlier semantic edits and disrupting the cumulative refinement process, thereby degrading fluency and coherence. As this failure originates in uniform and context agnostic updates, effective control demands explicit token ordering. We propose Token Timestep Allocation (TTA), which realizes soft and semantic token ordering via per token timestep schedules: critical tokens are frozen early, while uncertain tokens receive continued refinement. This timestep based ordering can be instantiated as either a fixed policy or an adaptive policy driven by task signals, thereby supporting a broad spectrum of refinement strategies. Because it operates purely at inference time, it applies uniformly across various DLMs and naturally extends to diverse supervision sources. Empirically, TTA improves controllability and fluency: on sentiment control, it yields more than 20 percent higher accuracy and nearly halves perplexity using less than one fifth the steps; in detoxification, it lowers maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0). Together, these results demonstrate that softened ordering via timestep allocation is the critical lever for mitigating update forgetting and achieving stable and controllable diffusion text generation.

Paper Structure

This paper contains 90 sections, 55 equations, 8 figures, 17 tables.

Figures (8)

  • Figure 1: Illustration of update-forgetting. Left: Classifier-guided semantic edits (e.g., love) can be inadvertently overwritten in later denoising steps (e.g., hate). Right: This example from our experiments shows positive sentiment tokens being reversed (clean$\rightarrow$dirty), undermining control; with allocation, the edits are preserved.
  • Figure 2: Fluctuation vs. perplexity across timesteps. At each timestep $t$, samples are grouped by fluctuation ratio, showing that higher fluctuation is consistently associated with higher perplexity.
  • Figure 3: Classifier confidence drop due to update-forgetting.
  • Figure 4: A comparison of token timestep allocation strategies. The left panel illustrates a default schedule in which all tokens share the same timestep. The middle panel depicts a linear schedule where timesteps decrease uniformly across tokens, allowing gradual denoising. The right panel demonstrates the adaptive schedule, in which critical tokens with high gradient values are assigned smaller timesteps, preserving important updates while refining less significant tokens.
  • Figure 5: Comparison of constant and adaptive allocation on (a) fluctuation ratio and (b) key-token change ratio.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2