Table of Contents
Fetching ...

LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis

Peiang Zhao, Han Li, Ruiyang Jin, S. Kevin Zhou

TL;DR

LoCo is proposed, a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions, and introduces a Localized Attention Constraint (LAC), leveraging semantic affinity between pixels in self-attention maps.

Abstract

Recent text-to-image diffusion models have reached an unprecedented level in generating high-quality images. However, their exclusive reliance on textual prompts often falls short in precise control of image compositions. In this paper, we propose LoCo, a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. Specifically, we introduce a Localized Attention Constraint (LAC), leveraging semantic affinity between pixels in self-attention maps to create precise representations of desired objects and effectively ensure the accurate placement of objects in designated regions. We further propose a Padding Token Constraint (PTC) to leverage the semantic information embedded in previously neglected padding tokens, improving the consistency between object appearance and layout instructions. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods. Extensive experiments showcase the superiority of our approach, surpassing existing state-of-the-art training-free layout-to-image methods both qualitatively and quantitatively across multiple benchmarks.

LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis

TL;DR

LoCo is proposed, a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions, and introduces a Localized Attention Constraint (LAC), leveraging semantic affinity between pixels in self-attention maps.

Abstract

Recent text-to-image diffusion models have reached an unprecedented level in generating high-quality images. However, their exclusive reliance on textual prompts often falls short in precise control of image compositions. In this paper, we propose LoCo, a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. Specifically, we introduce a Localized Attention Constraint (LAC), leveraging semantic affinity between pixels in self-attention maps to create precise representations of desired objects and effectively ensure the accurate placement of objects in designated regions. We further propose a Padding Token Constraint (PTC) to leverage the semantic information embedded in previously neglected padding tokens, improving the consistency between object appearance and layout instructions. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods. Extensive experiments showcase the superiority of our approach, surpassing existing state-of-the-art training-free layout-to-image methods both qualitatively and quantitatively across multiple benchmarks.
Paper Structure (16 sections, 11 equations, 7 figures, 5 tables)

This paper contains 16 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: (a) Accurate Spatial Control. Existing training-free layout-to-image synthesis (LIS) approaches struggle to generate high-quality images that adhere to the given layout instructions. In contrast, LoCo is able to provide accurate spatial control. (b) Plug-and-play. LoCo can be integrated to fully-supervised LIS methods, e.g., GLIGEN li2023gligen, serving as a plug-and-play booster to enhance their performance.
  • Figure 2: Overview of LoCo. LoCo consists of three steps: (a) Attention Aggregation, (b) Localized Attention Constraint, and (c) Padding Tokens Constraint. At timestep $t$, we pass latent feature $\mathbf{z}_{t}$ through the noise predictor to extract cross-attention maps $\mathbf{A}^{c}$ and self-attention map $\mathbf{A}^{s}$. For the i-th desired object, we obtain refined cross-attention map $\mathbf{A}^{r}_{i}$ via Self-Attention Enhancement to represent the object's appearance accurately. The proposed constraints, i.e., $\mathcal{L}_{LAC}$ and $\mathcal{L}_{PTC}$, are then applied to encourage the alignment between attention maps and layout instructions. Consequently, the latent feature $\mathbf{z}_{t}$ is updated with the $\triangledown\mathcal{L}_{LoCo}$ to obtain $\hat{\mathbf{z}_{t}}$ for denoising.
  • Figure 3: (a) Visualization of Self-Attention Enhancement (SAE). SAE highlights the non-salient parts of the corresponding objects. Therefore, $\mathbf{A}^{r}_{i}$ serves as precise representations of desired objects. (b) Cross-attention maps of Padding Tokens. One can observe from the examples that the padding tokens, i.e., start-of-text tokens ([SoT]) and end-of-text tokens ([EoT]) also carry substantial semantic and layout information.
  • Figure 4: Synthesized images with various conditioning inputs, e.g., different locations and desired objects. LoCo is able to handle various spatial layouts and novel scenes while maintaining high image synthesis capability and precise concept coverage.
  • Figure 5: (a) Visual comparisons with previous methods. We show visual comparisons between LoCo and several training-free layout-to-image methods. The layout instructions are annotated on the images with dashed boxes. Our results faithfully adhere to both textual and layout conditions, outperforming prior approaches in terms of spatial control and image quality. (b) Performance boost on fully-supervised layout-to-image method. LoCo enhances the performance of GLIGEN li2023gligen in generating multiple small objects significantly. Please zoom in for better view.
  • ...and 2 more figures