Table of Contents
Fetching ...

Training-free Composite Scene Generation for Layout-to-Image Synthesis

Jiaqi Liu, Tao Huang, Chang Xu

TL;DR

This work addresses the challenge of generating images with precise spatial layouts from diffusion models without requiring additional training data. It introduces a training-free framework that uses selective sampling to refine intra-token loss, and adds inter-token and self-attention constraints, complemented by attention redistribution during forward diffusion, all guided by bounding-box layouts. The approach achieves superior object localization and semantic fidelity compared to existing training-free methods, demonstrating strong gains in AP$_{50}$, AP, and CLIP scores and compatibility with GLIGEN. These contributions significantly reduce data acquisition costs for layout-aware image synthesis and enhance the applicability of diffusion-based layout-to-image generation in multi-object scenes. $ \\ $

Abstract

Recent breakthroughs in text-to-image diffusion models have significantly advanced the generation of high-fidelity, photo-realistic images from textual descriptions. Yet, these models often struggle with interpreting spatial arrangements from text, hindering their ability to produce images with precise spatial configurations. To bridge this gap, layout-to-image generation has emerged as a promising direction. However, training-based approaches are limited by the need for extensively annotated datasets, leading to high data acquisition costs and a constrained conceptual scope. Conversely, training-free methods face challenges in accurately locating and generating semantically similar objects within complex compositions. This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase. By refining intra-token loss with selective sampling and enhancing the diffusion process with attention redistribution, we propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships. Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity. Code is available at https://github.com/Papple-F/csg.git.

Training-free Composite Scene Generation for Layout-to-Image Synthesis

TL;DR

This work addresses the challenge of generating images with precise spatial layouts from diffusion models without requiring additional training data. It introduces a training-free framework that uses selective sampling to refine intra-token loss, and adds inter-token and self-attention constraints, complemented by attention redistribution during forward diffusion, all guided by bounding-box layouts. The approach achieves superior object localization and semantic fidelity compared to existing training-free methods, demonstrating strong gains in AP, AP, and CLIP scores and compatibility with GLIGEN. These contributions significantly reduce data acquisition costs for layout-aware image synthesis and enhance the applicability of diffusion-based layout-to-image generation in multi-object scenes.

Abstract

Recent breakthroughs in text-to-image diffusion models have significantly advanced the generation of high-fidelity, photo-realistic images from textual descriptions. Yet, these models often struggle with interpreting spatial arrangements from text, hindering their ability to produce images with precise spatial configurations. To bridge this gap, layout-to-image generation has emerged as a promising direction. However, training-based approaches are limited by the need for extensively annotated datasets, leading to high data acquisition costs and a constrained conceptual scope. Conversely, training-free methods face challenges in accurately locating and generating semantically similar objects within complex compositions. This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase. By refining intra-token loss with selective sampling and enhancing the diffusion process with attention redistribution, we propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships. Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity. Code is available at https://github.com/Papple-F/csg.git.
Paper Structure (24 sections, 11 equations, 9 figures, 13 tables)

This paper contains 24 sections, 11 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Composite scene generation facilitates the blending of various foreground and background elements into an image based on layout details.
  • Figure 2: In the workflow of Composite Scene Generation (CSG), at each refinement stage, we capture both self- and cross-attentions within a UNet structure. For self-attention, we aggregate self-attentions within each mask area $\bm{m}_i$ and calculate $\mathcal{L}_{self}$, which determines if pixel-level interaction is mostly constrained within the target area. For cross-attention, we first obtain $\mathcal{L}_{intra}$, a proportional measure of in-box and out-box cross-attentions for each attending token. Next, we assess the cross-attentions for all attending tokens within the same box area to obtain $\mathcal{L}_{inter}$, determining if the cross-attention of the current token is dominant within its own region. After a finite number of refinement steps, the latent is updated through the gradient of all three loss components. To further enhance the refinement process, we implement attention redistribution between each refinement stage.
  • Figure 3: Visual ablation studies on various components of proposed method.
  • Figure 4: Visual comparison with concurrent training-free methods including MultiDiffusion bar2023multidiffusion, BoxDiff xie2023boxdiff and Layout-control chen2024training. Layout information is sampled from COCO lin2014microsoft with 3 distinct objects.
  • Figure 5: The limitation lies in the globally incoherent and incorrectly attributed binding of synthesized images.
  • ...and 4 more figures