Table of Contents
Fetching ...

Inpainting-Guided Policy Optimization for Diffusion Large Language Models

Siyan Zhao, Mengchen Liu, Jing Huang, Miao Liu, Chenyu Wang, Bo Liu, Yuandong Tian, Guan Pang, Sean Bell, Aditya Grover, Feiyu Chen

TL;DR

This work addresses the reinforcement learning inefficiency in masked diffusion LLMs by exploiting their inpainting capability to guide exploration. The authors introduce IGPO, which injects partial ground-truth reasoning traces only in zero-advantage scenarios to restore informative gradients while preserving mostly on-policy generation. A two-stage training recipe—Length-Aligned SFT with rewritten concise traces followed by RL with IGPO—yields substantial gains on GSM8K, Math500, and AMC, achieving state-of-the-art results among full-attention masked dLLMs. Extensive ablations show partial inpainting and entropy-based filtering stabilize learning and that trace rewriting strengthens initialization for RL. The results imply that architectural properties of diffusion LLMs can be leveraged to improve sample efficiency and performance in complex mathematical reasoning tasks.

Abstract

Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity--their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning. We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with dLLM generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across three mathematical benchmarks--GSM8K, Math500, and AMC--achieving new state-of-the-art results for full-attention masked dLLMs.

Inpainting-Guided Policy Optimization for Diffusion Large Language Models

TL;DR

This work addresses the reinforcement learning inefficiency in masked diffusion LLMs by exploiting their inpainting capability to guide exploration. The authors introduce IGPO, which injects partial ground-truth reasoning traces only in zero-advantage scenarios to restore informative gradients while preserving mostly on-policy generation. A two-stage training recipe—Length-Aligned SFT with rewritten concise traces followed by RL with IGPO—yields substantial gains on GSM8K, Math500, and AMC, achieving state-of-the-art results among full-attention masked dLLMs. Extensive ablations show partial inpainting and entropy-based filtering stabilize learning and that trace rewriting strengthens initialization for RL. The results imply that architectural properties of diffusion LLMs can be leveraged to improve sample efficiency and performance in complex mathematical reasoning tasks.

Abstract

Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity--their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning. We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with dLLM generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across three mathematical benchmarks--GSM8K, Math500, and AMC--achieving new state-of-the-art results for full-attention masked dLLMs.

Paper Structure

This paper contains 31 sections, 8 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) Unlike autoregressive LLMs, diffusion LLMs can be conditioned on future reasoning hints during generation through inpainting via bidirectional attention, enabling guided exploration toward correct solutions. (b) Applying inpainting-guided exploration in policy optimization outperforms standard Group-relative Policy Optimization (GRPO) sampling and reduces all-wrong groups occurrences. (c) Our full training recipe combining Length-Aligned supervised fine-tuning on concise reasoning traces with IGPO achieves SoTA performance among full-attention masked dLLMs across three mathematical benchmarks.
  • Figure 2: Overview of IGPO: When all sampled responses yield identical incorrect rewards (zero-advantage scenario), we perform hint-guided inpainting by generating additional responses using ground truth reasoning chunks as injected hints. Ground truth traces $y^*$ are segmented into variable-length chunks, and selected chunks are injected as fixed hints during generation while the model generates the remaining tokens. We then replace a fraction of the original incorrect responses with correct responses generated through inpainting, creating reward variance that enables non-zero advantages for effective policy gradient updates.
  • Figure 3: RL training curves of IGPO versus normal GRPO sampling. (a) Starting from LLaDA-8B-Instruct. (b) Starting from the length-aligned SFT checkpoint. IGPO exhibits superior and more stable training performance under both initialization checkpoints. Results are averaged over 3 random seeds across three mathematical reasoning benchmarks (GSM8K, MATH500, and AMC), with standard errors shown as shaded regions.
  • Figure 4: Impact of hint injection ratio on performance across 3 datasets, averaged over 3 seeds with standard error shown as shaded areas. We compare partial hint injection ($\eta \sim \mathcal{U}[0.2, 0.6]$) versus full hint injection ($\eta = 1.0$). Partial hint injection consistently outperforms full hint injection, demonstrating the benefits of self-generated reasoning. Both hint-guided inpainting variants outperform the baseline without any hint injection.
  • Figure 5: Impact of entropy clipping threshold on hint tokens. Performance comparison across different entropy clipping thresholds $\tau$ applied to hint token positions in IGPO, where $\tau=0.2$ represents learning from only the top 20% highest-entropy hint token positions, while $\tau=1.0$ indicates learning from all hint token positions without filtering. This results is run on GSM8K with temperature of 0.1 and generation length of 256.
  • ...and 2 more figures