Table of Contents
Fetching ...

Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple Rewards

Qiang Lyu, Zicong Chen, Chongxiao Wang, Haolin Shi, Shibo Gao, Ran Piao, Youwei Zeng, Jianlou Si, Fei Ding, Jing Li, Chun Pong Lau, Weiqiang Wang

TL;DR

Multi-GRPO tackles two core GRPO limitations in text-to-image alignment: poor credit assignment for early denoising steps and reward mixing across multiple objectives. It introduces tree-based trajectories to provide richer, descendant-based estimates for early actions, coupled with reward-based grouping to compute per-reward advantages before aggregation. On PickScore-25k, it improves single-objective alignment; on OCR-Color-10 it achieves balanced improvements across text fidelity, color accuracy, and image quality. The approach yields more stable gradients and better multi-objective trade-offs, with a curated benchmark for visual-text rendering tasks and public code release.

Abstract

Recently, Group Relative Policy Optimization (GRPO) has shown promising potential for aligning text-to-image (T2I) models, yet existing GRPO-based methods suffer from two critical limitations. (1) \textit{Shared credit assignment}: trajectory-level advantages derived from group-normalized sparse terminal rewards are uniformly applied across timesteps, failing to accurately estimate the potential of early denoising steps with vast exploration spaces. (2) \textit{Reward-mixing}: predefined weights for combining multi-objective rewards (e.g., text accuracy, visual quality, text color)--which have mismatched scales and variances--lead to unstable gradients and conflicting updates. To address these issues, we propose \textbf{Multi-GRPO}, a multi-group advantage estimation framework with two orthogonal grouping mechanisms. For better credit assignment, we introduce tree-based trajectories inspired by Monte Carlo Tree Search: branching trajectories at selected early denoising steps naturally forms \emph{temporal groups}, enabling accurate advantage estimation for early steps via descendant leaves while amortizing computation through shared prefixes. For multi-objective optimization, we introduce \emph{reward-based grouping} to compute advantages for each reward function \textit{independently} before aggregation, disentangling conflicting signals. To facilitate evaluation of multiple objective alignment, we curate \textit{OCR-Color-10}, a visual text rendering dataset with explicit color constraints. Across the single-reward \textit{PickScore-25k} and multi-objective \textit{OCR-Color-10} benchmarks, Multi-GRPO achieves superior stability and alignment performance, effectively balancing conflicting objectives. Code will be publicly available at \href{https://github.com/fikry102/Multi-GRPO}{https://github.com/fikry102/Multi-GRPO}.

Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple Rewards

TL;DR

Multi-GRPO tackles two core GRPO limitations in text-to-image alignment: poor credit assignment for early denoising steps and reward mixing across multiple objectives. It introduces tree-based trajectories to provide richer, descendant-based estimates for early actions, coupled with reward-based grouping to compute per-reward advantages before aggregation. On PickScore-25k, it improves single-objective alignment; on OCR-Color-10 it achieves balanced improvements across text fidelity, color accuracy, and image quality. The approach yields more stable gradients and better multi-objective trade-offs, with a curated benchmark for visual-text rendering tasks and public code release.

Abstract

Recently, Group Relative Policy Optimization (GRPO) has shown promising potential for aligning text-to-image (T2I) models, yet existing GRPO-based methods suffer from two critical limitations. (1) \textit{Shared credit assignment}: trajectory-level advantages derived from group-normalized sparse terminal rewards are uniformly applied across timesteps, failing to accurately estimate the potential of early denoising steps with vast exploration spaces. (2) \textit{Reward-mixing}: predefined weights for combining multi-objective rewards (e.g., text accuracy, visual quality, text color)--which have mismatched scales and variances--lead to unstable gradients and conflicting updates. To address these issues, we propose \textbf{Multi-GRPO}, a multi-group advantage estimation framework with two orthogonal grouping mechanisms. For better credit assignment, we introduce tree-based trajectories inspired by Monte Carlo Tree Search: branching trajectories at selected early denoising steps naturally forms \emph{temporal groups}, enabling accurate advantage estimation for early steps via descendant leaves while amortizing computation through shared prefixes. For multi-objective optimization, we introduce \emph{reward-based grouping} to compute advantages for each reward function \textit{independently} before aggregation, disentangling conflicting signals. To facilitate evaluation of multiple objective alignment, we curate \textit{OCR-Color-10}, a visual text rendering dataset with explicit color constraints. Across the single-reward \textit{PickScore-25k} and multi-objective \textit{OCR-Color-10} benchmarks, Multi-GRPO achieves superior stability and alignment performance, effectively balancing conflicting objectives. Code will be publicly available at \href{https://github.com/fikry102/Multi-GRPO}{https://github.com/fikry102/Multi-GRPO}.

Paper Structure

This paper contains 41 sections, 21 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: (Left) Early branching produces richer diversity than delayed branching, indicating that early denoising steps are more critical for exploration. Motivated by this observation, our tree-based trajectories concentrate branching in the high-entropy early steps. (Right) Illustration of the reward-mixing problem and our reward-based grouping solution on the curated multi-objective OCR-Color-10 dataset.
  • Figure 2: Overview of Multi-GRPO. We introduces two orthogonal grouping mechanisms to address the limitations of standard GRPO. (Left) Tree-Based Trajectories by branching at early steps: To solve the shared credit assignment problem, we replace independent rollouts with tree-structured rollout. Early-step actions are evaluated based on a diverse set of descendant leaves, yielding more accurate estimates for critical early decisions. (Right) Reward-Based Grouping: To solve the reward-mixing problem in multi-objective optimization, we normalize advantages for each reward function independently before aggregation. This disentangles conflicting signals and prevents certain rewards from dominating the learning process. $n\in \{1,\cdots,N_j\}, m\in \{1,\cdots,M\}$. $N_j$ denotes the number of nodes at step $j$.
  • Figure 3: Illustration of Tree-based Trajectories.(Left) Standard GRPO: It uses independent rollouts. Trajectory-level advantages derived are uniformly applied across timesteps,, leading to poor credit assignment for early steps. (Right) Our method (Tree-Based Trajectories): We construct a tree by branching at selected early steps (e.g., $b_k,b_{k+1}$). This allows an early-step state to be evaluated by a diverse set of descendant leaves, providing a more reliable Monte Carlo estimate of its value.
  • Figure 4: Illustration of Reward-Based Grouping. To address the reward-mixing problem in multi-objective optimization, we avoid mixing rewards before normalization. Instead, each reward type (e.g., OCR,Color) is normalized independently against its own group statistics (mean and std of "Group$m$") to compute a disentangled advantage. The final advantage $\hat{A}^i$ is then formed by a scaled weighted sum of these individual advantages $\hat{A}^i_m$.
  • Figure 5: Ablation on the OCR-Color-10 dataset. Relative to the Flow-GRPO baseline, Tree-Based Trajectories show a generally upward trend across all three rewards, although the gain in $R_{\text{pick}}$ remains modest. Adding Reward-Based Grouping on top of Tree-Based Trajectories further boosts PickScore and leads to a more balanced improvement overall, forming the full Multi-GRPO approach.
  • ...and 10 more figures