RubricRL: Simple Generalizable Rewards for Text-to-Image Generation
Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan, Dongdong Chen, Chunming Qiao
TL;DR
RubricRL tackles reward design in text-to-image reinforcement learning by generating a per-prompt rubric of interpretable criteria and scoring each criterion with a multimodal grader, forming a reward $R_{rubric}(I,p,\mathcal{C}(p)) = \frac{1}{M}\sum_{i=1}^M y_i$ where each $y_i\in\{0,1\}$. It integrates this reward with Group Relative Policy Optimization (GRPO) and a dynamic rollout strategy, where $A_i = \frac{R_i - \bar{R}_g}{\sqrt{\frac{1}{|g|-1}\sum_{j\in g}(R_j - \bar{R}_g)^2}}$ guides updates over rollout groups, and rollouts are drawn with oversampling $N'$ and selection of $N$ via a hybrid top-$K$ plus random sampling. The Rubric Generation Model constructs $\mathcal{C}(p)$ of criteria (e.g., object count, attributes, OCR fidelity, realism) that adapt to each prompt, enabling prompt-adaptive, decomposable supervision. Empirical results on autoregressive T2I show RubricRL improves prompt faithfulness, visual detail, and realism with better generalization than prior multi-reward or unified-scalar methods, and analyses confirm robustness to grader choice and rollout configurations. Overall, RubricRL provides an interpretable, extensible, and architecture-agnostic framework for RL-based alignment of text-to-image generation with human preferences, with practical impact on controllable and auditable visual synthesis systems.
Abstract
Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt--a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism--tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.
