Table of Contents
Fetching ...

Synthetic Curriculum Reinforces Compositional Text-to-Image Generation

Shijian Wang, Runhao Fu, Siyi Zhao, Qingqin Zhan, Xingjian Wang, Jiarui Jin, Yuan Lu, Hanqian Wu, Cunjian Chen

TL;DR

A novel compositional curriculum reinforcement learning framework named CompGen is proposed that addresses compositional weakness in existing T2I models and leverages scene graphs to establish a novel difficulty criterion for compositional ability and develops a corresponding adaptive Markov Chain Monte Carlo graph sampling algorithm.

Abstract

Text-to-Image (T2I) generation has long been an open problem, with compositional synthesis remaining particularly challenging. This task requires accurate rendering of complex scenes containing multiple objects that exhibit diverse attributes as well as intricate spatial and semantic relationships, demanding both precise object placement and coherent inter-object interactions. In this paper, we propose a novel compositional curriculum reinforcement learning framework named CompGen that addresses compositional weakness in existing T2I models. Specifically, we leverage scene graphs to establish a novel difficulty criterion for compositional ability and develop a corresponding adaptive Markov Chain Monte Carlo graph sampling algorithm. This difficulty-aware approach enables the synthesis of training curriculum data that progressively optimize T2I models through reinforcement learning. We integrate our curriculum learning approach into Group Relative Policy Optimization (GRPO) and investigate different curriculum scheduling strategies. Our experiments reveal that CompGen exhibits distinct scaling curves under different curriculum scheduling strategies, with easy-to-hard and Gaussian sampling strategies yielding superior scaling performance compared to random sampling. Extensive experiments demonstrate that CompGen significantly enhances compositional generation capabilities for both diffusion-based and auto-regressive T2I models, highlighting its effectiveness in improving the compositional T2I generation systems.

Synthetic Curriculum Reinforces Compositional Text-to-Image Generation

TL;DR

A novel compositional curriculum reinforcement learning framework named CompGen is proposed that addresses compositional weakness in existing T2I models and leverages scene graphs to establish a novel difficulty criterion for compositional ability and develops a corresponding adaptive Markov Chain Monte Carlo graph sampling algorithm.

Abstract

Text-to-Image (T2I) generation has long been an open problem, with compositional synthesis remaining particularly challenging. This task requires accurate rendering of complex scenes containing multiple objects that exhibit diverse attributes as well as intricate spatial and semantic relationships, demanding both precise object placement and coherent inter-object interactions. In this paper, we propose a novel compositional curriculum reinforcement learning framework named CompGen that addresses compositional weakness in existing T2I models. Specifically, we leverage scene graphs to establish a novel difficulty criterion for compositional ability and develop a corresponding adaptive Markov Chain Monte Carlo graph sampling algorithm. This difficulty-aware approach enables the synthesis of training curriculum data that progressively optimize T2I models through reinforcement learning. We integrate our curriculum learning approach into Group Relative Policy Optimization (GRPO) and investigate different curriculum scheduling strategies. Our experiments reveal that CompGen exhibits distinct scaling curves under different curriculum scheduling strategies, with easy-to-hard and Gaussian sampling strategies yielding superior scaling performance compared to random sampling. Extensive experiments demonstrate that CompGen significantly enhances compositional generation capabilities for both diffusion-based and auto-regressive T2I models, highlighting its effectiveness in improving the compositional T2I generation systems.

Paper Structure

This paper contains 32 sections, 5 equations, 9 figures, 6 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overall performance of our CompGen, indicating that CompGen achieves state-of-the-art performance among models of the same scale.
  • Figure 2: Overview of our CompGen framework, which is incentivized to construct a curriculum through end-to-end reinforcement learning without requiring ground-truth images.
  • Figure 3: An illustrated example of scene graph corresponding to a specific difficulty level.
  • Figure 4: Qualitative comparison of our CompGen with other strong text-to-image generation models (SD1.5, SD2.1, SDXL, and Lumina-Next). Within each prompt, we color the elements for which at least one model makes an error: the object in blue, the attribute in brown, the relationship in green, and the count in purple. , , , denote Object, Attribute, Relationship, and Count, respectively. A indicates correct generation, while a indicates an error. Additional examples appear in Appendix \ref{['app:vis']}.
  • Figure 5: Scaling trend of CompGen with different curriculum scheduling strategies.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 1: Scene Graph Formulated Difficulty
  • Definition 2: Synthetic Curriculum-based RL