Table of Contents
Fetching ...

HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

Hongji Yang, Yucheng Zhou, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen

TL;DR

This work tackles the difficulty diffusion models face when following complex hierarchical prompts. It introduces Chain of Synthesis, an LLM driven parse and rewrite plus iterative in context assembly that builds images step by step, guided by a hierarchical RL framework with a Decaying Stochasticity Schedule to maximize exploration early in generation. The approach is complemented by a three-part reward and the HiCoPrompt benchmark, enabling rigorous evaluation of compositional accuracy and concept coverage. Empirical results show HiCoGen outperforms existing diffusion and subject driven models in both quantitative metrics and qualitative fidelity, demonstrating improved handling of multiobject scenes and relations.

Abstract

Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.

HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

TL;DR

This work tackles the difficulty diffusion models face when following complex hierarchical prompts. It introduces Chain of Synthesis, an LLM driven parse and rewrite plus iterative in context assembly that builds images step by step, guided by a hierarchical RL framework with a Decaying Stochasticity Schedule to maximize exploration early in generation. The approach is complemented by a three-part reward and the HiCoPrompt benchmark, enabling rigorous evaluation of compositional accuracy and concept coverage. Empirical results show HiCoGen outperforms existing diffusion and subject driven models in both quantitative metrics and qualitative fidelity, demonstrating improved handling of multiobject scenes and relations.

Abstract

Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.

Paper Structure

This paper contains 23 sections, 1 theorem, 17 equations, 6 figures, 6 tables.

Key Result

Theorem 1

Consider the reverse SDE in Eq. eq:reverse_sde with a fixed budget $\int_0^T \eta(t)^2 dt = C > 0$. Under Assumptions 1 and 2 below, the schedule $\eta(t)$ that maximizes the final sample diversity $\text{Tr}(\text{Cov}(\mathbf{z}_0))$ is a monotonically decreasing function of time $t$ (from $T$ to

Figures (6)

  • Figure 1: The motivation of HiCoGen. The semantic gap between text and images widens as the complexity of the text increases, particularly involving the prompts with a hierarchical relationship. While a single T2I model performs well in generating individual objects, it suffers from concept missing and confusion when processing complex prompts. HiCoGen employs a Chain of Synthesis for complex text to preserve the semantic content.
  • Figure 2: The overall pipeline of our proposed HiCoGen framework. When facing a complex hierarchical compositional prompt, HiCoGen applies the Chain of Synthesis to progressively construct the image part-by-part and employs in-context generative models to assemble the different components into the final image. This ensures all the concepts in the text domain are present in the image domain.
  • Figure 3: Illustration of our proposed HiCoPrompt dataset. It features clearly defined multiple hierarchical relationships, as well as thoroughly described concrete subjects and attributes.
  • Figure 4: The visual result of our HiCoGen. HiCoGen greatly mitigates the issue of concept missing or confusion in image generation. When handling prompts with clear hierarchical relationships and multiple complex subjects, it significantly outperforms other T2I models.
  • Figure 5: Similarity between samples in diffusion GRPO during the training process. The samples obtained from baseline are highly similar, which reduces the diversity of the samples.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1: Optimal Stochasticity Allocation for Diversity Maximization
  • proof : Proof of Theorem \ref{['thm:optimal_schedule']}