Table of Contents
Fetching ...

AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning

Shihao Yuan, Yahui Liu, Yang Yue, Jingyuan Zhang, Wangmeng Zuo, Qi Wang, Fuzheng Zhang, Guorui Zhou

TL;DR

This work addresses the mismatch between maximum likelihood training and perceptual/human-centric quality in autoregressive image generation by introducing AR-GRPO, which applies Group Relative Policy Optimization to fine-tune AR image models with multi-objective rewards.The method integrates multi-faceted rewards—semantic alignment (CLIP+HPSv2), image quality (MANIQA), and realism (Qwen VL)—into a GRPO-based RL loop that samples groups of outputs and updates the policy with KL-regularized advantages.Experiments on class-conditional (C2I) and text-conditional (T2I) generation show consistent improvements across objective metrics and human preferences, while also revealing a trade-off between quality and diversity due to reduced policy entropy.The results demonstrate the viability of RL-based optimization for autoregressive image generation and highlight potential for controllable, high-quality synthesis across different model sizes and resolutions.

Abstract

Inspired by the success of reinforcement learning (RL) in refining large language models (LLMs), we propose AR-GRPO, an approach to integrate online RL training into autoregressive (AR) image generation models. We adapt the Group Relative Policy Optimization (GRPO) algorithm to refine the vanilla autoregressive models' outputs by carefully designed reward functions that evaluate generated images across multiple quality dimensions, including perceptual quality, realism, and semantic fidelity. We conduct comprehensive experiments on both class-conditional (i.e., class-to-image) and text-conditional (i.e., text-to-image) image generation tasks, demonstrating that our RL-enhanced framework significantly improves both the image quality and human preference of generated images compared to the standard AR baselines. Our results show consistent improvements across various evaluation metrics, establishing the viability of RL-based optimization for AR image generation and opening new avenues for controllable and high-quality image synthesis. The source codes and models are available at: https://github.com/Kwai-Klear/AR-GRPO.

AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning

TL;DR

This work addresses the mismatch between maximum likelihood training and perceptual/human-centric quality in autoregressive image generation by introducing AR-GRPO, which applies Group Relative Policy Optimization to fine-tune AR image models with multi-objective rewards.The method integrates multi-faceted rewards—semantic alignment (CLIP+HPSv2), image quality (MANIQA), and realism (Qwen VL)—into a GRPO-based RL loop that samples groups of outputs and updates the policy with KL-regularized advantages.Experiments on class-conditional (C2I) and text-conditional (T2I) generation show consistent improvements across objective metrics and human preferences, while also revealing a trade-off between quality and diversity due to reduced policy entropy.The results demonstrate the viability of RL-based optimization for autoregressive image generation and highlight potential for controllable, high-quality synthesis across different model sizes and resolutions.

Abstract

Inspired by the success of reinforcement learning (RL) in refining large language models (LLMs), we propose AR-GRPO, an approach to integrate online RL training into autoregressive (AR) image generation models. We adapt the Group Relative Policy Optimization (GRPO) algorithm to refine the vanilla autoregressive models' outputs by carefully designed reward functions that evaluate generated images across multiple quality dimensions, including perceptual quality, realism, and semantic fidelity. We conduct comprehensive experiments on both class-conditional (i.e., class-to-image) and text-conditional (i.e., text-to-image) image generation tasks, demonstrating that our RL-enhanced framework significantly improves both the image quality and human preference of generated images compared to the standard AR baselines. Our results show consistent improvements across various evaluation metrics, establishing the viability of RL-based optimization for AR image generation and opening new avenues for controllable and high-quality image synthesis. The source codes and models are available at: https://github.com/Kwai-Klear/AR-GRPO.

Paper Structure

This paper contains 17 sections, 7 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Autoregressive image generation enhanced by reinforcement learning based on LlamaGen sun2024autoregressive. We show samples from our class-conditional image (top row) and text-conditional image (bottom row) generation models.
  • Figure 2: Visual results for class-conditional image generation following RL training under various configurations: model sizes "B", "L", and "XL", and resolutions of 256$\times$256 and 384$\times$384.
  • Figure 3: Policy entropy comparison between baseline and RL-trained models under various architectural configurations on class-conditional image generation scenarios. "B", "L" and "XL" refer to the model size and "256" and "384" denote to the image resolution.
  • Figure 4: Policy entropy curve throughout the training process on text-conditional image generation task.
  • Figure 5: Overall reward curves as a function of training steps during reinforcement learning for C2I (Left) and T2I (Right) image generation models.
  • ...and 10 more figures