Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models
Tianren Ma, Mu Zhang, Yibing Wang, Qixiang Ye
TL;DR
This work tackles the difficulty of reward-based learning in discrete diffusion models for multimodal generation. It introduces MaskGRPO, a modality-aware extension of Group Relative Policy Optimization that combines a clarified DDM foundation with a low-variance importance estimator and rollout adaptations tailored to language and vision. The approach delivers substantial RL gains on language reasoning benchmarks and improves text–image alignment and visual fidelity, approaching the performance of leading diffusion systems in discrete settings. By demonstrating stable, efficient policy updates and strong multimodal generation, MaskGRPO paves the way for practical reward-driven learning in discretized visual diffusion and multimodal RL.
Abstract
Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.
