Diffusion-State Policy Optimization for Masked Diffusion Language Models
Daisuke Oba, Hiroki Furuta, Naoaki Okazaki
TL;DR
This paper tackles the problem of coarse credit assignment in terminal-reward policy optimization for masked diffusion language models (MDLMs). It introduces Diffusion-State Policy Optimization (DiSPO), a plug-in layer that performs state-wise, same-state branching at intermediate denoising steps by resampling fillings from cached logits and updating only newly filled tokens, thereby providing finer-grained credit assignment without extra diffusion rollouts. The authors formalize a fixed-state objective, derive a policy-gradient estimator, and show that DiSPO can be combined with terminal-feedback objectives to form a mixed objective with coherent gradients. Empirically, DiSPO yields consistent accuracy gains on math and planning benchmarks (e.g., Sudoku, Countdown, GSM8K, MATH500) for LLaDA-8B-Instruct under matched rollout compute and updates, with variance-reduction benefits from token-local updates and same-state averaging. The results demonstrate that intermediate-state credit assignment can meaningfully improve MDLM reasoning and planning tasks, suggesting broader applicability to diffusion-based NLP models.
Abstract
Masked diffusion language models generate by iteratively filling masked tokens over multiple denoising steps, so learning only from a terminal reward on the final completion yields coarse credit assignment over intermediate decisions. We propose DiSPO (Diffusion-State Policy Optimization), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling fillings for the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens -- without additional multi-step diffusion rollouts. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that can be combined with terminal-feedback policy optimization using the same rollouts. On LLaDA-8B-Instruct, DiSPO consistently improves over the terminal-feedback diffu-GRPO baseline on math and planning benchmarks under matched rollout compute and optimizer steps. Our code will be available at https://daioba.github.io/dispo .
