Table of Contents
Fetching ...

Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones

Ranfei Chen, Ming Chen, Kaifei Wang

TL;DR

This work identifies non-uniform, dynamic zones of confusion in diffusion language model reasoning and shows that uniform gradient budgeting across denoising steps is suboptimal. It introduces ATPO, a lightweight adaptive step-selection framework that uses batch-averaged RoEC and CM signals to allocate gradients to high-leverage steps, without modifying the RL objective or compute budget. Empirical results across multiple reasoning benchmarks demonstrate that ATPO improves final accuracy and training stability, outperforming several trajectory-based baselines, and is compatible with existing RL methods. By leveraging trajectory dynamics, ATPO provides a principled and practical path toward more reliable and efficient RL for diffusion-based language models.

Abstract

Diffusion Large Language Models (dLLMs) are rapidly emerging alongside autoregressive models as a powerful paradigm for complex reasoning, with reinforcement learning increasingly used for downstream alignment. Existing trajectory-based RL methods uniformly allocate policy gradients across denoising steps, implicitly treating all steps as equally important. We challenge this assumption by analyzing trajectories with several step-level metrics: entropy-based uncertainty, Confidence-Margin (CM) uncertainty, and Rate of Entropy Change (RoEC). These reveal structured "zones of confusion": transient spikes in uncertainty and instability that strongly predict final success or failure, while most steps remain stable. We propose Adaptive Trajectory Policy Optimization (ATPO), a lightweight step-selection strategy that dynamically reallocates gradient updates to these high-leverage steps without changing the RL objective, rewards, or compute budget. Using a hybrid RoEC+CM rule, ATPO delivers substantial gains in reasoning accuracy and training stability across benchmarks, showing that exploiting trajectory dynamics is key to advancing dLLM RL.

Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones

TL;DR

This work identifies non-uniform, dynamic zones of confusion in diffusion language model reasoning and shows that uniform gradient budgeting across denoising steps is suboptimal. It introduces ATPO, a lightweight adaptive step-selection framework that uses batch-averaged RoEC and CM signals to allocate gradients to high-leverage steps, without modifying the RL objective or compute budget. Empirical results across multiple reasoning benchmarks demonstrate that ATPO improves final accuracy and training stability, outperforming several trajectory-based baselines, and is compatible with existing RL methods. By leveraging trajectory dynamics, ATPO provides a principled and practical path toward more reliable and efficient RL for diffusion-based language models.

Abstract

Diffusion Large Language Models (dLLMs) are rapidly emerging alongside autoregressive models as a powerful paradigm for complex reasoning, with reinforcement learning increasingly used for downstream alignment. Existing trajectory-based RL methods uniformly allocate policy gradients across denoising steps, implicitly treating all steps as equally important. We challenge this assumption by analyzing trajectories with several step-level metrics: entropy-based uncertainty, Confidence-Margin (CM) uncertainty, and Rate of Entropy Change (RoEC). These reveal structured "zones of confusion": transient spikes in uncertainty and instability that strongly predict final success or failure, while most steps remain stable. We propose Adaptive Trajectory Policy Optimization (ATPO), a lightweight step-selection strategy that dynamically reallocates gradient updates to these high-leverage steps without changing the RL objective, rewards, or compute budget. Using a hybrid RoEC+CM rule, ATPO delivers substantial gains in reasoning accuracy and training stability across benchmarks, showing that exploiting trajectory dynamics is key to advancing dLLM RL.

Paper Structure

This paper contains 16 sections, 4 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Static uncertainty metrics during denoising on GSM8K. The top subplot shows the average Entropy-based Uncertainty, and the bottom subplot shows the average Confidence Margin (CM), both plotted over diffusion steps. Color distinguishes correct (blue) vs. incorrect (orange) samples, and line style marks an early phase of training (after 100 updates; solid) vs. a late phase of training (after 2000 updates; dashed). These results correspond to the "Static Uncertainty Metrics" analysis in the main text.
  • Figure 2: Comparison of entropy curves under different step selection strategies. Uniform step selection (green) results in a smooth but potentially less informative trajectory. Strategies based on Entropy-based Uncertainty and CM are more sensitive to the sharp changes in the early-to-mid diffusion steps, allowing the model to focus on these critical phases.
  • Figure 3: Checkpoint accuracy over training steps under four step selection strategies: uniform subsampling, ATPO with RoEC-only Step Selection, ATPO with CM-Only Step Selection, and ATPO with Hybrid RoEC+CM Step Selection. The hybrid policy yields both the highest final accuracy and the smoothest training curve.