Table of Contents
Fetching ...

Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models

Shaoan Xie, Lingjing Kong, Xiangchen Song, Xinshuai Dong, Guangyi Chen, Eric P. Xing, Kun Zhang

TL;DR

This work tackles the challenge of training diffusion LLMs to perform complex, multi-step reasoning. It introduces a hierarchical framework that treats reasoning as staged, localized constraints linked by latent variables, and proposes Step-Aware Policy Optimization (SAPO) to align the MdLLM denoising process with this structure. A novel step-based reward, built on GRPO, encourages incremental progress along the reasoning hierarchy, mitigating unstructured refinement where steps contribute little to the solution. Empirical results across multiple reasoning benchmarks show improved alignment between intermediate reasoning and final answers, stronger benchmark performance, and better generalization, with additional insights into reward learnability and efficiency. The approach offers a principled direction for making diffusion-based reasoning both more accurate and more interpretable, with practical implications for faster, structured generation in complex tasks.

Abstract

Diffusion language models (dLLMs) offer a promising, non-autoregressive paradigm for text generation, yet training them for complex reasoning remains a key challenge. Current reinforcement learning approaches often rely on sparse, outcome-based rewards, which can reinforce flawed reasoning paths that lead to coincidentally correct answers. We argue that this stems from a fundamental mismatch with the natural structure of reasoning. We first propose a theoretical framework that formalizes complex problem solving as a hierarchical selection process, where an intractable global constraint is decomposed into a series of simpler, localized logical steps. This framework provides a principled foundation for algorithm design, including theoretical insights into the identifiability of this latent reasoning structure. Motivated by this theory, we identify unstructured refinement -- a failure mode where a model's iterative steps do not contribute meaningfully to the solution -- as a core deficiency in existing methods. We then introduce Step-Aware Policy Optimization (SAPO), a novel RL algorithm that aligns the dLLM's denoising process with the latent reasoning hierarchy. By using a process-based reward function that encourages incremental progress, SAPO guides the model to learn structured, coherent reasoning paths. Our empirical results show that this principled approach significantly improves performance on challenging reasoning benchmarks and enhances the interpretability of the generation process.

Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models

TL;DR

This work tackles the challenge of training diffusion LLMs to perform complex, multi-step reasoning. It introduces a hierarchical framework that treats reasoning as staged, localized constraints linked by latent variables, and proposes Step-Aware Policy Optimization (SAPO) to align the MdLLM denoising process with this structure. A novel step-based reward, built on GRPO, encourages incremental progress along the reasoning hierarchy, mitigating unstructured refinement where steps contribute little to the solution. Empirical results across multiple reasoning benchmarks show improved alignment between intermediate reasoning and final answers, stronger benchmark performance, and better generalization, with additional insights into reward learnability and efficiency. The approach offers a principled direction for making diffusion-based reasoning both more accurate and more interpretable, with practical implications for faster, structured generation in complex tasks.

Abstract

Diffusion language models (dLLMs) offer a promising, non-autoregressive paradigm for text generation, yet training them for complex reasoning remains a key challenge. Current reinforcement learning approaches often rely on sparse, outcome-based rewards, which can reinforce flawed reasoning paths that lead to coincidentally correct answers. We argue that this stems from a fundamental mismatch with the natural structure of reasoning. We first propose a theoretical framework that formalizes complex problem solving as a hierarchical selection process, where an intractable global constraint is decomposed into a series of simpler, localized logical steps. This framework provides a principled foundation for algorithm design, including theoretical insights into the identifiability of this latent reasoning structure. Motivated by this theory, we identify unstructured refinement -- a failure mode where a model's iterative steps do not contribute meaningfully to the solution -- as a core deficiency in existing methods. We then introduce Step-Aware Policy Optimization (SAPO), a novel RL algorithm that aligns the dLLM's denoising process with the latent reasoning hierarchy. By using a process-based reward function that encourages incremental progress, SAPO guides the model to learn structured, coherent reasoning paths. Our empirical results show that this principled approach significantly improves performance on challenging reasoning benchmarks and enhances the interpretability of the generation process.

Paper Structure

This paper contains 35 sections, 5 theorems, 19 equations, 11 figures, 3 tables.

Key Result

Theorem 3.1

A learned model can recover the true latent reasoning steps $(\mathbf{S}_2, \dots, \mathbf{S}_L)$ of a reasoning process if it matches the true process's observable outputs and satisfies key structural conditions. Specifically, if the model favors the simplest explanation (i.e., sparsity constraint

Figures (11)

  • Figure 1: The problem of unstructured refinement. A standard MdLLM trained with an outcome-only reward produces a correct answer but fills its reasoning trace with meaningless, repetitive tokens. This indicates the iterative process is not contributing meaningfully to the solution.
  • Figure 2: Decomposition of reasoning complexity via hierarchical selection. The arrows represent the bottom-up selection process where lower-level variables determine higher-level concepts (Eq. \ref{['eq:selection_mechanism']}). (Left) A direct selection model requires a single, high-complexity function where all elements of the response $\mathbf{R}$ jointly determine the validity of the question $\mathbf{Q}$. (Right) Our hierarchical model decomposes this into simple, localized selection functions with intermediate selection variables $\mathbf{S}$, greatly reducing complexity.
  • Figure 2: Generalization ability comparison. The trained models are evaluated on unseen datasets: the reasoning benchmark SVAMP patel2021nlp and the commonsense benchmark ARC clark2018think.
  • Figure 3: Illustration of the proposed step-aware reward. To encourage intermediate generations to contribute meaningfully to the final outcome, we generate new rollouts from randomly selected steps $t_1, t_2$ and estimate their contribution by the difference in outcome rewards. A larger difference indicates a higher contribution toward the final correct answer.
  • Figure 4: Comparison of generated responses across models. LLaDA nie2025large and diffu-GRPO zhao2025d1 both produce incorrect answers to the evaluation question. LLaDA’s response includes a brief but partially meaningful reasoning step toward the end, whereas diffu-GRPO continues generating verbose sentences that contribute little to the final prediction. In contrast, our model provides a structured reasoning process and successfully arrives at the correct answer. This highlights that optimizing solely for accuracy-based rewards may lead to sub-optimal outcomes, as such rewards overlook the quality and coherence of reasoning within the response.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Theorem 3.1: Informal: Recovering the Latent Reasoning Process
  • Lemma B.1: Single-level Subspace Identifiability
  • proof
  • Lemma B.3: Pair-wise Identification zheng2025nonparametric
  • Lemma B.5: Single-level Component-wise Identifiability
  • proof
  • Theorem B.6: Identifiability of the Reasoning Hierarchy
  • proof