Table of Contents
Fetching ...

Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO

Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, Pheng-Ann Heng

TL;DR

This work systematically compares GRPO and DPO for autoregressive image generation to assess their capabilities for CoT-like reasoning in a multimodal setting. It finds that DPO delivers stronger in-domain performance, whereas GRPO generalizes better out-of-domain, with reward-model generalization playing a key role in overall RL generalization. The study also analyzes how reward-model design and three scaling strategies (sampling, data diversity, and iterative training) differentially affect ID and OOD performance, offering concrete guidance for future RL-based image generation research. Overall, the results delineate clear trade-offs between data efficiency and generalization, and they point to reward-model robustness as a crucial lever for robust multimodal CoT reasoning.

Abstract

Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at https://github.com/ZiyuGuo99/Image-Generation-CoT

Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO

TL;DR

This work systematically compares GRPO and DPO for autoregressive image generation to assess their capabilities for CoT-like reasoning in a multimodal setting. It finds that DPO delivers stronger in-domain performance, whereas GRPO generalizes better out-of-domain, with reward-model generalization playing a key role in overall RL generalization. The study also analyzes how reward-model design and three scaling strategies (sampling, data diversity, and iterative training) differentially affect ID and OOD performance, offering concrete guidance for future RL-based image generation research. Overall, the results delineate clear trade-offs between data efficiency and generalization, and they point to reward-model robustness as a crucial lever for robust multimodal CoT reasoning.

Abstract

Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at https://github.com/ZiyuGuo99/Image-Generation-CoT

Paper Structure

This paper contains 21 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Investigation for GRPO and DPO in Autoregressive Image Generation. We analyze the advantages of GRPO and DPO in both in-domain and out-of-domain scenarios (Top-left), the effect of different reward models (Top-right), and the influence of scaling strategies (Bottom), providing unique insights to this field.
  • Figure 2: Visualization Results of In-Domain vs. Out-of-Domain Performance Comparison.
  • Figure 3: (a) The Impact of Different Reward Models' Intrinsic Generalization Capability. We evaluate the generalization performance of GRPO, DPO, and the intrinsic generalization performance (represented by best-of-4 strategy) of three reward models. (b-e) Effects of Three Scaling Strategies. We examine the effects of various scaling strategies, including sampling size, in-domain data diversity, and iterative training, on both in-domain and out-of-domain performance.
  • Figure 4: Visualization Results of the Impact of Different Reward Models.
  • Figure 5: Visualization Results of Insights from Investigating Scaling Strategies.