Table of Contents
Fetching ...

Progressive Text-to-Image Diffusion with Soft Latent Direction

YuTeng Ye, Jiale Cai, Hang Zhou, Guanwen Li, Youjia Zhang, Zikai Song, Chenxing Gao, Junqing Yu, Wei Yang

TL;DR

This paper tackles the challenge of generating and editing images containing multiple entities under complex relational constraints in text-to-image generation. It introduces a progressive SRF framework that uses a large language model to decompose long prompts into structured synthesis/editing/erasing directives and applies soft latent direction to guide diffusion across steps. Through Stimulus, Response, and Fusion, the method steers cross-attention and fuses latent representations to insert, modify, or erase objects while preserving layout continuity, achieving higher fidelity than existing baselines. The approach enables interactive, stepwise control over multi-entity generation, setting a new benchmark for relational text-to-image synthesis and editing, albeit with limitations on prompts that are not easily decomposable.

Abstract

In spite of the rapidly evolving landscape of text-to-image generation, the synthesis and manipulation of multiple entities while adhering to specific relational constraints pose enduring challenges. This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image, ensuring their adherence to spatial and relational constraints at each sequential step. Our key insight stems from the observation that while a pre-trained text-to-image diffusion model adeptly handles one or two entities, it often falters when dealing with a greater number. To address this limitation, we propose harnessing the capabilities of a Large Language Model (LLM) to decompose intricate and protracted text descriptions into coherent directives adhering to stringent formats. To facilitate the execution of directives involving distinct semantic operations-namely insertion, editing, and erasing-we formulate the Stimulus, Response, and Fusion (SRF) framework. Within this framework, latent regions are gently stimulated in alignment with each operation, followed by the fusion of the responsive latent components to achieve cohesive entity manipulation. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs. Consequently, it establishes a new benchmark for text-to-image generation tasks, further elevating the field's performance standards.

Progressive Text-to-Image Diffusion with Soft Latent Direction

TL;DR

This paper tackles the challenge of generating and editing images containing multiple entities under complex relational constraints in text-to-image generation. It introduces a progressive SRF framework that uses a large language model to decompose long prompts into structured synthesis/editing/erasing directives and applies soft latent direction to guide diffusion across steps. Through Stimulus, Response, and Fusion, the method steers cross-attention and fuses latent representations to insert, modify, or erase objects while preserving layout continuity, achieving higher fidelity than existing baselines. The approach enables interactive, stepwise control over multi-entity generation, setting a new benchmark for relational text-to-image synthesis and editing, albeit with limitations on prompts that are not easily decomposable.

Abstract

In spite of the rapidly evolving landscape of text-to-image generation, the synthesis and manipulation of multiple entities while adhering to specific relational constraints pose enduring challenges. This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image, ensuring their adherence to spatial and relational constraints at each sequential step. Our key insight stems from the observation that while a pre-trained text-to-image diffusion model adeptly handles one or two entities, it often falters when dealing with a greater number. To address this limitation, we propose harnessing the capabilities of a Large Language Model (LLM) to decompose intricate and protracted text descriptions into coherent directives adhering to stringent formats. To facilitate the execution of directives involving distinct semantic operations-namely insertion, editing, and erasing-we formulate the Stimulus, Response, and Fusion (SRF) framework. Within this framework, latent regions are gently stimulated in alignment with each operation, followed by the fusion of the responsive latent components to achieve cohesive entity manipulation. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs. Consequently, it establishes a new benchmark for text-to-image generation tasks, further elevating the field's performance standards.
Paper Structure (14 sections, 4 equations, 8 figures, 3 tables)

This paper contains 14 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Existing text-to-image synthesis approaches struggle with textual prompts involving multiple entities and specified relational directions. We propose to decompose the protracted prompt into a set of short commands, including synthesis, editing and erasing operations, using a Large Language Model (LLM) and progressively generate the image. Our strategy enhances both controllability and fidelity and allows for interactive modifications from user interference at each generation step.
  • Figure 2: We employ a fine-tuned GPT model to deconstruct a comprehensive text into structured prompts, each classified under synthesis, editing, and erasing operations.
  • Figure 3: For the synthesis operation, we generate the layout indicated in the prompt from a frozen GPT-4 model, which subsequently yields the new bounding box coordinates for object insertion.
  • Figure 4: Overview of our unified framework emphasizes progressive synthesis, editing, and erasing. In each progressive step, A random latent $z_t$ is directed through the cross-attention map in inverse diffusion. Specifically, we design a soft stimulus loss that evaluates the positional difference between entity attention and the target mask region, leading to a gradient for updating the latent $z_{t-1}^{*}$ as a latent response. Subsequentially, another forward diffusion pass is applied to denoise $z^*_{t}$, yielding deriving $z^{*}_{t-1}$. In the latent fusion phase, we transform the previous $i$-th image into a latent code $z^{bg}_{t-1}$ using DDIM inversion. The blending of $z^{*}_{t-1}$ with $z^{bg}_{t-1}$ incorporates a dynamic evolving mask, which starts with a layout box and gradually shifts to cross-attention. Finally, $z^{*}_{t-1}$ undergoes multiple diffusion reverse steps and results in the $(i+1)$-th image.
  • Figure 5: Visual results generated by Stable Diffusion and Stimulus & Response. Stable Diffusion shows noticeable problems in positional generation (top), semantic and attribute coupling (middle), and object omission (bottom), while ours delivers precise outcomes.
  • ...and 3 more figures