Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones

Ranfei Chen; Ming Chen; Kaifei Wang

Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones

Ranfei Chen, Ming Chen, Kaifei Wang

TL;DR

This work identifies non-uniform, dynamic zones of confusion in diffusion language model reasoning and shows that uniform gradient budgeting across denoising steps is suboptimal. It introduces ATPO, a lightweight adaptive step-selection framework that uses batch-averaged RoEC and CM signals to allocate gradients to high-leverage steps, without modifying the RL objective or compute budget. Empirical results across multiple reasoning benchmarks demonstrate that ATPO improves final accuracy and training stability, outperforming several trajectory-based baselines, and is compatible with existing RL methods. By leveraging trajectory dynamics, ATPO provides a principled and practical path toward more reliable and efficient RL for diffusion-based language models.

Abstract

Diffusion Large Language Models (dLLMs) are rapidly emerging alongside autoregressive models as a powerful paradigm for complex reasoning, with reinforcement learning increasingly used for downstream alignment. Existing trajectory-based RL methods uniformly allocate policy gradients across denoising steps, implicitly treating all steps as equally important. We challenge this assumption by analyzing trajectories with several step-level metrics: entropy-based uncertainty, Confidence-Margin (CM) uncertainty, and Rate of Entropy Change (RoEC). These reveal structured "zones of confusion": transient spikes in uncertainty and instability that strongly predict final success or failure, while most steps remain stable. We propose Adaptive Trajectory Policy Optimization (ATPO), a lightweight step-selection strategy that dynamically reallocates gradient updates to these high-leverage steps without changing the RL objective, rewards, or compute budget. Using a hybrid RoEC+CM rule, ATPO delivers substantial gains in reasoning accuracy and training stability across benchmarks, showing that exploiting trajectory dynamics is key to advancing dLLM RL.

Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones

TL;DR

Abstract

Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)