Table of Contents
Fetching ...

RubricRL: Simple Generalizable Rewards for Text-to-Image Generation

Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan, Dongdong Chen, Chunming Qiao

TL;DR

RubricRL tackles reward design in text-to-image reinforcement learning by generating a per-prompt rubric of interpretable criteria and scoring each criterion with a multimodal grader, forming a reward $R_{rubric}(I,p,\mathcal{C}(p)) = \frac{1}{M}\sum_{i=1}^M y_i$ where each $y_i\in\{0,1\}$. It integrates this reward with Group Relative Policy Optimization (GRPO) and a dynamic rollout strategy, where $A_i = \frac{R_i - \bar{R}_g}{\sqrt{\frac{1}{|g|-1}\sum_{j\in g}(R_j - \bar{R}_g)^2}}$ guides updates over rollout groups, and rollouts are drawn with oversampling $N'$ and selection of $N$ via a hybrid top-$K$ plus random sampling. The Rubric Generation Model constructs $\mathcal{C}(p)$ of criteria (e.g., object count, attributes, OCR fidelity, realism) that adapt to each prompt, enabling prompt-adaptive, decomposable supervision. Empirical results on autoregressive T2I show RubricRL improves prompt faithfulness, visual detail, and realism with better generalization than prior multi-reward or unified-scalar methods, and analyses confirm robustness to grader choice and rollout configurations. Overall, RubricRL provides an interpretable, extensible, and architecture-agnostic framework for RL-based alignment of text-to-image generation with human preferences, with practical impact on controllable and auditable visual synthesis systems.

Abstract

Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt--a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism--tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.

RubricRL: Simple Generalizable Rewards for Text-to-Image Generation

TL;DR

RubricRL tackles reward design in text-to-image reinforcement learning by generating a per-prompt rubric of interpretable criteria and scoring each criterion with a multimodal grader, forming a reward where each . It integrates this reward with Group Relative Policy Optimization (GRPO) and a dynamic rollout strategy, where guides updates over rollout groups, and rollouts are drawn with oversampling and selection of via a hybrid top- plus random sampling. The Rubric Generation Model constructs of criteria (e.g., object count, attributes, OCR fidelity, realism) that adapt to each prompt, enabling prompt-adaptive, decomposable supervision. Empirical results on autoregressive T2I show RubricRL improves prompt faithfulness, visual detail, and realism with better generalization than prior multi-reward or unified-scalar methods, and analyses confirm robustness to grader choice and rollout configurations. Overall, RubricRL provides an interpretable, extensible, and architecture-agnostic framework for RL-based alignment of text-to-image generation with human preferences, with practical impact on controllable and auditable visual synthesis systems.

Abstract

Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt--a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism--tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.

Paper Structure

This paper contains 23 sections, 6 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Visual examples of our RubricRL on two language backbones. Equipped with interpretable and user-controlled criteria, RubricRL improves SFT models' performance to generate high-quality images.
  • Figure 2: Comparison of RubricRL with prior autoregressive (AR) reward formulations. (a) Multi-reward pipelines combine CLIP, OCR, and realism metrics but require fragile weight tuning and often miss fine-grained attributes. (b) Unified scalar models collapse diverse objectives into a single learned score, simplifying optimization but reducing interpretability and adaptability. (c) RubricRL replaces both with a decomposable, prompt-adaptive rubric—an explicit checklist of visual criteria (counting, attributes, OCR/text fidelity, realism). Each criterion is scored independently and integrated into GRPO to provide interpretable, variance-aware supervision that improves detail, prompt faithfulness, and debuggability.
  • Figure 3: Overview of the proposed method. We propose a simple, general rubric generation pipeline and rubric-based reward model for unified text-to-image generation.
  • Figure 4: Failure cases of GPT-o4-mini when grading counting on GenEVal. The model misjudges instance counts under ambiguity.
  • Figure 5: Qualitative comparison: we visualize RubricRL and baseline models using prompts from DPG. RubricRL shows superior image quality that is both aesthetically pleasing and better aligned with the prompt. The bold text highlights key elements that RubricRL successfully captures, while baseline models often fail to generate these details accurately.
  • ...and 2 more figures