Table of Contents
Fetching ...

MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency

Nicolas Dufour, Lucas Degeorge, Arijit Ghosh, Vicky Kalogeiton, David Picard

TL;DR

MIRO addresses the inefficiencies and data discarding of post-hoc alignment by introducing multi-reward conditioning directly in pretraining for text-to-image generation. It learns a conditional distribution $p_\theta(x|c,\mathbf{s})$ by augmenting the dataset with multi-reward annotations, training with a multi-reward flow-matching objective, and enabling reward-guided inference. Empirically, MIRO achieves state-of-the-art GenEval and user-preference scores, converges up to $19\times$ faster, and attains substantial compute efficiency (e.g., $>370\times$ fewer FLOPs than a large baseline) while improving compositional alignment. The approach also supports synthetic captions and test-time scaling to further enhance trade-offs between aesthetics and alignment, demonstrating robust cross-metric generalization and practical deployment potential for controllable, reward-aware generation.

Abstract

Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).

MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency

TL;DR

MIRO addresses the inefficiencies and data discarding of post-hoc alignment by introducing multi-reward conditioning directly in pretraining for text-to-image generation. It learns a conditional distribution by augmenting the dataset with multi-reward annotations, training with a multi-reward flow-matching objective, and enabling reward-guided inference. Empirically, MIRO achieves state-of-the-art GenEval and user-preference scores, converges up to faster, and attains substantial compute efficiency (e.g., fewer FLOPs than a large baseline) while improving compositional alignment. The approach also supports synthetic captions and test-time scaling to further enhance trade-offs between aesthetics and alignment, demonstrating robust cross-metric generalization and practical deployment potential for controllable, reward-aware generation.

Abstract

Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).

Paper Structure

This paper contains 25 sections, 3 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Images from our MIRO Synth model on PartiPromptyu2022scaling.
  • Figure 2: MIRO training pipeline. Top: dataset scoring with multiple rewards $r_1,\ldots,r_N$ produces a scores vector $\hat{\mathbf{s}}$. Bottom: during training, the model conditions on $\hat{\mathbf{s}}$ and a noisy input $x_t=(1-t)x+t\,\epsilon$ to learn to denoise toward high-reward regions.
  • Figure 3: MIRO inference overview (single model). The previous step $x_t$ and caption are fed to one MIRO model while conditioning on two reward histograms: $\hat{\mathbf{s}}^{+}$ (top) and $\hat{\mathbf{s}}^{-}$ (bottom), producing $v_\theta(x_t,c,\hat{\mathbf{s}}^{+})$ and $v_\theta(x_t,c,\hat{\mathbf{s}}^{-})$. The guidance direction $\Delta v = v_\theta(\hat{\mathbf{s}}^{+}) - v_\theta(\hat{\mathbf{s}}^{-})$ is scaled by $\omega$ and added to the high-reward output to obtain the guided image $\hat{v}_\theta$.
  • Figure 4: Comparison of the MIRO model against eight other specialist/baseline models. Each radar plot shows MIRO versus a comparison model across six metrics.
  • Figure 5: Training curves showing reward evolution during training. $\times$ Baseline, $\diamond$ MIRO.
  • ...and 11 more figures