Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple Rewards

Qiang Lyu; Zicong Chen; Chongxiao Wang; Haolin Shi; Shibo Gao; Ran Piao; Youwei Zeng; Jianlou Si; Fei Ding; Jing Li; Chun Pong Lau; Weiqiang Wang

Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple Rewards

Qiang Lyu, Zicong Chen, Chongxiao Wang, Haolin Shi, Shibo Gao, Ran Piao, Youwei Zeng, Jianlou Si, Fei Ding, Jing Li, Chun Pong Lau, Weiqiang Wang

TL;DR

Multi-GRPO tackles two core GRPO limitations in text-to-image alignment: poor credit assignment for early denoising steps and reward mixing across multiple objectives. It introduces tree-based trajectories to provide richer, descendant-based estimates for early actions, coupled with reward-based grouping to compute per-reward advantages before aggregation. On PickScore-25k, it improves single-objective alignment; on OCR-Color-10 it achieves balanced improvements across text fidelity, color accuracy, and image quality. The approach yields more stable gradients and better multi-objective trade-offs, with a curated benchmark for visual-text rendering tasks and public code release.

Abstract

Recently, Group Relative Policy Optimization (GRPO) has shown promising potential for aligning text-to-image (T2I) models, yet existing GRPO-based methods suffer from two critical limitations. (1) \textit{Shared credit assignment}: trajectory-level advantages derived from group-normalized sparse terminal rewards are uniformly applied across timesteps, failing to accurately estimate the potential of early denoising steps with vast exploration spaces. (2) \textit{Reward-mixing}: predefined weights for combining multi-objective rewards (e.g., text accuracy, visual quality, text color)--which have mismatched scales and variances--lead to unstable gradients and conflicting updates. To address these issues, we propose \textbf{Multi-GRPO}, a multi-group advantage estimation framework with two orthogonal grouping mechanisms. For better credit assignment, we introduce tree-based trajectories inspired by Monte Carlo Tree Search: branching trajectories at selected early denoising steps naturally forms \emph{temporal groups}, enabling accurate advantage estimation for early steps via descendant leaves while amortizing computation through shared prefixes. For multi-objective optimization, we introduce \emph{reward-based grouping} to compute advantages for each reward function \textit{independently} before aggregation, disentangling conflicting signals. To facilitate evaluation of multiple objective alignment, we curate \textit{OCR-Color-10}, a visual text rendering dataset with explicit color constraints. Across the single-reward \textit{PickScore-25k} and multi-objective \textit{OCR-Color-10} benchmarks, Multi-GRPO achieves superior stability and alignment performance, effectively balancing conflicting objectives. Code will be publicly available at \href{https://github.com/fikry102/Multi-GRPO}{https://github.com/fikry102/Multi-GRPO}.

Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple Rewards

TL;DR

Abstract

Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple Rewards

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)