Table of Contents
Fetching ...

Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen

TL;DR

This work reveals temporal oscillation in diffusion language models, where correct intermediate outputs are overwritten during denoising. It introduces two tools to exploit temporal dynamics: Temporal Self-Consistency Voting, a training-free test-time decoding strategy that aggregates across denoising steps, and Temporal Consistency Reinforcement, a post-training RL approach using Temporal Semantic Entropy as a self-supervised reward. The methods yield consistent gains across multiple math benchmarks, with notable improvements on Countdown (up to 24.7% with negative TSE and up to 25.3% when combined with an accuracy reward). By treating intermediate denoising steps as a signal rather than noise, the paper demonstrates a practical path to more reliable diffusion-based text generation and motivates further exploration of temporal stability in dLLMs.

Abstract

Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.

Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

TL;DR

This work reveals temporal oscillation in diffusion language models, where correct intermediate outputs are overwritten during denoising. It introduces two tools to exploit temporal dynamics: Temporal Self-Consistency Voting, a training-free test-time decoding strategy that aggregates across denoising steps, and Temporal Consistency Reinforcement, a post-training RL approach using Temporal Semantic Entropy as a self-supervised reward. The methods yield consistent gains across multiple math benchmarks, with notable improvements on Countdown (up to 24.7% with negative TSE and up to 25.3% when combined with an accuracy reward). By treating intermediate denoising steps as a signal rather than noise, the paper demonstrates a practical path to more reliable diffusion-based text generation and motivates further exploration of temporal stability in dLLMs.

Abstract

Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.

Paper Structure

This paper contains 35 sections, 9 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Illustration of temporal oscillation during sampling. (a) Across four datasets, a significant gap is observed between the final answer's pass rate (denoted as $\operatorname{Pass} @ 1$) and the ever-pass rate at any intermediate step (denoted as $\operatorname{EverPass} @ 1 \mid t$). This gap reveals the phenomenon we refer to as temporal oscillation, where correct intermediate answers are sometimes overwritten as the generation proceeds. (b) Example of temporal oscillation: For a given math problem, the model initially gives the correct answer, 25, at an intermediate step (e.g., step 55), aligning with the ground truth. However, by the final step, this correct answer is replaced with an incorrect one: 2.
  • Figure 2: Patterns of accuracy evolution over diffusion sampling steps. Responses of length 128 are generated with 64 steps using LLaDA-8B-Instruct. Left: Accuracy generally rises with more steps across datasets; SVAMP starts high, while harder ones like Countdown start low but improve steadily. Middle/Right: We compare the final pass rate, $\operatorname{Pass}@1$, with cumulative $\operatorname{EverPass}@1 \mid t$ over steps. A clear gap persists between them, shown by the green shaded area.
  • Figure 3: Patterns of entropy evolution over diffusion sampling steps. Responses are generated with length 128 using 64 diffusion steps from the LLaDA-8B-Instruct model. Left: Average token-level entropy decreases steadily during sampling. GSM8K shows lower entropy than Countdown, aligning with its higher accuracy. Middle and Right: Both Intermediate-Correct and Always-Incorrect questions exhibit higher overall entropy compared to Finally-Correct ones. On GSM8K, Intermediate-Correct questions display lower entropy in the early steps than Always-Incorrect, indicating initial confidence, whereas on Countdown the entropy trend is less stable.
  • Figure 4: Temporal semantic entropy across four benchmarks. This metric measures the uncertainty in the semantic content of answers across decoding steps. Statistically, correctly answered questions exhibit lower entropy.
  • Figure 5: (a) Ablations on $\alpha$ value selection in temporal voting with exponential weighting. (b) Negative temporal semantic entropy reward curve during reinforcement fine-tuning.
  • ...and 5 more figures