Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones
Ranfei Chen, Ming Chen, Kaifei Wang
TL;DR
This work identifies non-uniform, dynamic zones of confusion in diffusion language model reasoning and shows that uniform gradient budgeting across denoising steps is suboptimal. It introduces ATPO, a lightweight adaptive step-selection framework that uses batch-averaged RoEC and CM signals to allocate gradients to high-leverage steps, without modifying the RL objective or compute budget. Empirical results across multiple reasoning benchmarks demonstrate that ATPO improves final accuracy and training stability, outperforming several trajectory-based baselines, and is compatible with existing RL methods. By leveraging trajectory dynamics, ATPO provides a principled and practical path toward more reliable and efficient RL for diffusion-based language models.
Abstract
Diffusion Large Language Models (dLLMs) are rapidly emerging alongside autoregressive models as a powerful paradigm for complex reasoning, with reinforcement learning increasingly used for downstream alignment. Existing trajectory-based RL methods uniformly allocate policy gradients across denoising steps, implicitly treating all steps as equally important. We challenge this assumption by analyzing trajectories with several step-level metrics: entropy-based uncertainty, Confidence-Margin (CM) uncertainty, and Rate of Entropy Change (RoEC). These reveal structured "zones of confusion": transient spikes in uncertainty and instability that strongly predict final success or failure, while most steps remain stable. We propose Adaptive Trajectory Policy Optimization (ATPO), a lightweight step-selection strategy that dynamically reallocates gradient updates to these high-leverage steps without changing the RL objective, rewards, or compute budget. Using a hybrid RoEC+CM rule, ATPO delivers substantial gains in reasoning accuracy and training stability across benchmarks, showing that exploiting trajectory dynamics is key to advancing dLLM RL.
