Table of Contents
Fetching ...

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Xinchen Zhang, Ling Yang, Guohao Li, Yaqi Cai, Jiake Xie, Yong Tang, Yujiu Yang, Mengdi Wang, Bin Cui

TL;DR

IterComp addresses the fragmented strengths of diffusion models in compositional text-to-image generation by assembling a model gallery and learning composition-aware rewards. Through three targeted metrics (attribute binding, spatial, and non-spatial relationships), it trains specialized reward models and applies an iterative feedback loop to progressively refine both rewards and the base diffusion model. The approach achieves state-of-the-art compositionality and realism on benchmark tasks, with strong qualitative and quantitative gains and demonstrates good generalization to other models. Its open-loop-to-closed-loop framework offers a scalable path toward holistic compositional generation in diffusion models.

Abstract

Advanced diffusion models like RPG, Stable Diffusion 3 and FLUX have made notable strides in compositional text-to-image generation. However, these methods typically exhibit distinct strengths for compositional generation, with some excelling in handling attribute binding and others in spatial relationships. This disparity highlights the need for an approach that can leverage the complementary strengths of various models to comprehensively improve the composition capability. To this end, we introduce IterComp, a novel framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation. Specifically, we curate a gallery of six powerful open-source diffusion models and evaluate their three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. Based on these metrics, we develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models. Then, we propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations. Theoretical proof demonstrates the effectiveness and extensive experiments show our significant superiority over previous SOTA methods (e.g., Omost and FLUX), particularly in multi-category object composition and complex semantic alignment. IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation. Code: https://github.com/YangLing0818/IterComp

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

TL;DR

IterComp addresses the fragmented strengths of diffusion models in compositional text-to-image generation by assembling a model gallery and learning composition-aware rewards. Through three targeted metrics (attribute binding, spatial, and non-spatial relationships), it trains specialized reward models and applies an iterative feedback loop to progressively refine both rewards and the base diffusion model. The approach achieves state-of-the-art compositionality and realism on benchmark tasks, with strong qualitative and quantitative gains and demonstrates good generalization to other models. Its open-loop-to-closed-loop framework offers a scalable path toward holistic compositional generation in diffusion models.

Abstract

Advanced diffusion models like RPG, Stable Diffusion 3 and FLUX have made notable strides in compositional text-to-image generation. However, these methods typically exhibit distinct strengths for compositional generation, with some excelling in handling attribute binding and others in spatial relationships. This disparity highlights the need for an approach that can leverage the complementary strengths of various models to comprehensively improve the composition capability. To this end, we introduce IterComp, a novel framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation. Specifically, we curate a gallery of six powerful open-source diffusion models and evaluate their three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. Based on these metrics, we develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models. Then, we propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations. Theoretical proof demonstrates the effectiveness and extensive experiments show our significant superiority over previous SOTA methods (e.g., Omost and FLUX), particularly in multi-category object composition and complex semantic alignment. IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation. Code: https://github.com/YangLing0818/IterComp

Paper Structure

This paper contains 40 sections, 2 theorems, 19 equations, 14 figures, 7 tables, 1 algorithm.

Key Result

Lemma 1

The unified optimization framework of iterative feedback learning can be formulated as:

Figures (14)

  • Figure 1: Motivation of IterComp. We select three types of compositional generation methods. The results show that different models exhibit distinct strengths across various aspects of compositional generation. \ref{['fig:dataset']} further demonstrated these distinct strengths quantitatively.
  • Figure 1: Statistics on the composition-aware model preference dataset. The dataset consists of 3,500 text prompts, 27,500 images, and 52,500 image-rank pairs.
  • Figure 2: Overview of IterComp. We collect composition-aware model preferences from multiple models and employ an iterative feedback learning approach to enable the progressive self-refinement of both the base diffusion model and reward models.
  • Figure 3: The proportion of each model ranked first.
  • Figure 4: Qualitative comparison between our IterComp and three types of compositional generation methods: text-controlled, LLM-controlled, and layout-controlled approaches. IterComp is the first reward-controlled method for compositional generation, utilizing an iterative feedback learning framework to enhance the compositionality of generated images. Colored text denotes the advantages of IterComp in generated images.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Lemma 1
  • Theorem 1
  • proof : Proof of Lemma \ref{['lemma1']}
  • proof : Proof of Theorem \ref{['theorem1']}