Table of Contents
Fetching ...

Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

Tianren Ma, Mu Zhang, Yibing Wang, Qixiang Ye

TL;DR

This work tackles the difficulty of reward-based learning in discrete diffusion models for multimodal generation. It introduces MaskGRPO, a modality-aware extension of Group Relative Policy Optimization that combines a clarified DDM foundation with a low-variance importance estimator and rollout adaptations tailored to language and vision. The approach delivers substantial RL gains on language reasoning benchmarks and improves text–image alignment and visual fidelity, approaching the performance of leading diffusion systems in discrete settings. By demonstrating stable, efficient policy updates and strong multimodal generation, MaskGRPO paves the way for practical reward-driven learning in discretized visual diffusion and multimodal RL.

Abstract

Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.

Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

TL;DR

This work tackles the difficulty of reward-based learning in discrete diffusion models for multimodal generation. It introduces MaskGRPO, a modality-aware extension of Group Relative Policy Optimization that combines a clarified DDM foundation with a low-variance importance estimator and rollout adaptations tailored to language and vision. The approach delivers substantial RL gains on language reasoning benchmarks and improves text–image alignment and visual fidelity, approaching the performance of leading diffusion systems in discrete settings. By demonstrating stable, efficient policy updates and strong multimodal generation, MaskGRPO paves the way for practical reward-driven learning in discretized visual diffusion and multimodal RL.

Abstract

Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.

Paper Structure

This paper contains 39 sections, 27 equations, 7 figures, 4 tables, 5 algorithms.

Figures (7)

  • Figure 1: Left: MaskGRPO consistently improves the base model with significant RL income across text and image generation tasks. Right: an intuitive demonstration of our method, integrated with modality-specific innovations on importance estimation and sampling methods.
  • Figure 2: A demonstration of reversing (re-mask) methods. We set mask raio $r=0.6$. Random reversing (right) applies masks to all the tokens with equal probability, while AR-like reversing (left) adapts a fading-out strategy. See Appendix \ref{['append-samples']} for complete showcases.
  • Figure 3: A comparison of sampled results. With identical sampling parameters on MMaDA (equipped with a 8192-vocab visual tokenizer xieShowoOneSingle2024), images sampled by our emerge method (below) demonstrate better texture and expressiveness.
  • Figure 4: Qualitative comparison. Results are generated with identical sampling parameters and shown in {original, w/ RL} pairs. MaskGRPO demonstrates substantial improvement on the aesthetic quality of generated images, in terms of artistic style, photographic details and overall atmosphere. We strongly recommend that the readers view more portrait samples at Fig. \ref{['fig:portraits']}.
  • Figure 5: Figures for ablative studies.a: ablation on timestep truncation in language tasks. b: ablation on reverse methods in language tasks. c: ablation on timestep truncation in vision tasks. d: ablation for clip range in vision tasks. See text for detailed explanation.
  • ...and 2 more figures