Table of Contents
Fetching ...

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

Weijia Mao, Hao Chen, Zhenheng Yang, Mike Zheng Shou

TL;DR

Adv-GRPO introduces an adversarial reinforcement-learning framework for text-to-image generation that replaces static scalar rewards with a co-trained reward discriminator guided by high-quality reference images and, optionally, visual foundation-model rewards. The generator is optimized via GRPO, while the reward model learns to distinguish reference images from generated samples, mitigating reward hacking and bias. The approach is extended to dense visual signals by attaching a head to frozen visual backbones (e.g., DINO) to produce global and local rewards, enabling distribution transfer and style customization. Across SD3 baselines, Adv-GRPO achieves better human-perceived image quality and aesthetics than Flow-GRPO and SD3, while maintaining competitive benchmark scores; it also demonstrates data-efficient performance with few reference images and supports RL-based style transfer, with code and models released.

Abstract

A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

TL;DR

Adv-GRPO introduces an adversarial reinforcement-learning framework for text-to-image generation that replaces static scalar rewards with a co-trained reward discriminator guided by high-quality reference images and, optionally, visual foundation-model rewards. The generator is optimized via GRPO, while the reward model learns to distinguish reference images from generated samples, mitigating reward hacking and bias. The approach is extended to dense visual signals by attaching a head to frozen visual backbones (e.g., DINO) to produce global and local rewards, enabling distribution transfer and style customization. Across SD3 baselines, Adv-GRPO achieves better human-perceived image quality and aesthetics than Flow-GRPO and SD3, while maintaining competitive benchmark scores; it also demonstrates data-efficient performance with few reference images and supports RL-based style transfer, with code and models released.

Abstract

A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.

Paper Structure

This paper contains 22 sections, 14 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Overview of our approach. Our method Adv-GRPO improves text-to-image (T2I) generation in three ways: 1) Alleviate Reward Hacking, achieving higher perceptual quality while maintaining comparable benchmark performance (e.g., PickScore, OCR), as shown in the top-left human evaluation panel; 2) Visual Foundation Model as Reward, leveraging visual foundation models (e.g., DINO) for rich visual priors, leading to overall improvements as shown in middle-top human evaluation results; 3) RL-based Distribution Transfer, enabling style customization by aligning generations with reference domains.
  • Figure 2: Human evaluation comparing Flow-GRPO and SD3 under PickScore and OCR rewards.
  • Figure 3: Pipeline of Adv-GRPO. The generator is optimized using the GRPO loss, while the discriminator is trained to distinguish between generated samples and reference images, treated as negative and positive samples, respectively. The discriminator serves as a reward model to provide feedback for the generator.
  • Figure 4: Human evaluation under PickScore- and OCR-based rewards. Our method Adv-GRPO improves image quality and aesthetics with PickScore reward in a), and for all metrics with OCR reward in b). Compared with the original model (SD3), PickScore reward trade-off aesthetic improvements with image quality degradation in c), OCR reward trade-off text-alignment from aesthetics degradation in d).
  • Figure 5: Visualizations under PickScore (Left) and OCR (Right) rewards. Our method Adv-GRPO alleviates reward hacking for both.
  • ...and 13 more figures