Don't Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation
Woojin Kim, Jaeyoung Do
TL;DR
This work identifies update-forgetting as a key bottleneck in diffusion language models, where uniform, context-agnostic updates erase previously applied classifier-guided edits and degrade fluency. It introduces Token Timestep Allocation (TTA-Diffusion), an inference-time framework that assigns per-token timesteps to implement soft, semantic token ordering, thereby preserving edits while reducing unnecessary updates. The method includes both fixed and adaptive schedules, with adaptive allocation guided by classifier gradients and a smoothing mechanism to balance stability and fluency. Empirical results on detoxification and sentiment control show substantial gains in controllability and fluency with significantly fewer diffusion steps, and the approach generalizes across discrete, continuous, and progressive-step-reduction settings. Overall, TTA-Diffusion offers a principled, inference-time mechanism to enforce token-level ordering and improve efficiency in diffusion-based text generation.
Abstract
While diffusion language models (DLMs) enable fine-grained refinement, their practical controllability remains fragile. We identify and formally characterize a central failure mode called update forgetting, in which uniform and context agnostic updates induce token level fluctuations across timesteps, erasing earlier semantic edits and disrupting the cumulative refinement process, thereby degrading fluency and coherence. As this failure originates in uniform and context agnostic updates, effective control demands explicit token ordering. We propose Token Timestep Allocation (TTA), which realizes soft and semantic token ordering via per token timestep schedules: critical tokens are frozen early, while uncertain tokens receive continued refinement. This timestep based ordering can be instantiated as either a fixed policy or an adaptive policy driven by task signals, thereby supporting a broad spectrum of refinement strategies. Because it operates purely at inference time, it applies uniformly across various DLMs and naturally extends to diverse supervision sources. Empirically, TTA improves controllability and fluency: on sentiment control, it yields more than 20 percent higher accuracy and nearly halves perplexity using less than one fifth the steps; in detoxification, it lowers maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0). Together, these results demonstrate that softened ordering via timestep allocation is the critical lever for mitigating update forgetting and achieving stable and controllable diffusion text generation.
