UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

Jie Liu; Zilyu Ye; Linxiao Yuan; Shenhan Zhu; Yu Gao; Jie Wu; Kunchang Li; Xionghui Wang; Xiaonan Nie; Weilin Huang; Wanli Ouyang

UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, Wanli Ouyang

Abstract

Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.

UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

Abstract

Paper Structure (35 sections, 8 equations, 8 figures, 3 tables)

This paper contains 35 sections, 8 equations, 8 figures, 3 tables.

Introduction
Related Work
RL for LLMs
RL for Diffusion and Flow Matching Models
Unified Multimodal Understanding and Generation Models
Concurrent Work
Preliminary
Text GRPO
Flow GRPO
SDE Sampling.
Mitigating Reward Hacking via RatioNorm.
Method
Multimodal Generation as a Markov Decision Process
UniGRPO Framework
Eliminating Classifier-Free Guidance.
...and 20 more sections

Figures (8)

Figure 1: Overview of UniGRPO. By formulating interleaved generation as a joint MDP, this illustration demonstrates how UniGRPO jointly optimizes discrete language actions ($y_k$) in the LLM's next-token prediction, and continuous visual actions ($x_{t_k-\Delta t}$) in flow matching. Both policies are updated using group-relative advantages derived from sparse terminal rewards.
Figure 2: T2I qualitative comparison.
Figure 3: Training and Validation reward curves of UniGRPO on the Finetuned Bagel base model at a resolution of 1024. The x-axis represents the gradient update steps.
Figure 4: Ablation Study on CFG. Removing CFG during training yields comparable or superior performance, showing that CFG is unnecessary for RL-based training. Note that CFG is applied at evaluation for all settings. Furthermore, these results are not directly comparable to the curves in Figure \ref{['fig:unigrpo_curve']}, as this ablation uses the original Bagel as the base model at a resolution of 512.
Figure 5: Ablation Study on Regularization Strategies. From left to right: training reward, validation reward, and images generated under three different regularization strategies. Without regularization, the validation reward drops after an initial increase, leading to unnatural, oversaturated textures in the generated images. For KL divergence on the latents, the significant drop in training reward indicates that a sufficiently large KL coefficient has been used, yet grid-like artifacts still emerge as early as step 250, prompting us to terminate this run early. In contrast, directly applying MSE regularization on the velocity field ensures stable training dynamics and produces high-fidelity images with realistic textures.
...and 3 more figures

UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

Abstract

UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

Authors

Abstract

Table of Contents

Figures (8)