Table of Contents
Fetching ...

Obtaining Favorable Layouts for Multiple Object Generation

Barak Battash, Amit Rozner, Lior Wolf, Ofir Lindenbaum

TL;DR

The paper tackles the challenge of generating images with multiple specified subjects using diffusion-based text-to-image models, where traditional methods often neglect or blend subjects. It introduces a three-phase framework that (i) excites and separates per-subject cross-attention maps in the initial diffusion steps, (ii) derives and rearranges per-subject masks in the latent space, and (iii) enforces alignment of attention maps to fixed masks during later steps. The method leverages novel losses on cross-attention maps and a latent-space reallocation strategy to produce more faithful layouts, validated by extensive quantitative and qualitative comparisons against strong baselines. The results show substantial improvements in multi-subject fidelity, with careful discussion of limitations such as increased latency and potential layout-induced tradeoffs.

Abstract

Large-scale text-to-image models that can generate high-quality and diverse images based on textual prompts have shown remarkable success. These models aim ultimately to create complex scenes, and addressing the challenge of multi-subject generation is a critical step towards this goal. However, the existing state-of-the-art diffusion models face difficulty when generating images that involve multiple subjects. When presented with a prompt containing more than one subject, these models may omit some subjects or merge them together. To address this challenge, we propose a novel approach based on a guiding principle. We allow the diffusion model to initially propose a layout, and then we rearrange the layout grid. This is achieved by enforcing cross-attention maps (XAMs) to adhere to proposed masks and by migrating pixels from latent maps to new locations determined by us. We introduce new loss terms aimed at reducing XAM entropy for clearer spatial definition of subjects, reduce the overlap between XAMs, and ensure that XAMs align with their respective masks. We contrast our approach with several alternative methods and show that it more faithfully captures the desired concepts across a variety of text prompts.

Obtaining Favorable Layouts for Multiple Object Generation

TL;DR

The paper tackles the challenge of generating images with multiple specified subjects using diffusion-based text-to-image models, where traditional methods often neglect or blend subjects. It introduces a three-phase framework that (i) excites and separates per-subject cross-attention maps in the initial diffusion steps, (ii) derives and rearranges per-subject masks in the latent space, and (iii) enforces alignment of attention maps to fixed masks during later steps. The method leverages novel losses on cross-attention maps and a latent-space reallocation strategy to produce more faithful layouts, validated by extensive quantitative and qualitative comparisons against strong baselines. The results show substantial improvements in multi-subject fidelity, with careful discussion of limitations such as increased latency and potential layout-induced tradeoffs.

Abstract

Large-scale text-to-image models that can generate high-quality and diverse images based on textual prompts have shown remarkable success. These models aim ultimately to create complex scenes, and addressing the challenge of multi-subject generation is a critical step towards this goal. However, the existing state-of-the-art diffusion models face difficulty when generating images that involve multiple subjects. When presented with a prompt containing more than one subject, these models may omit some subjects or merge them together. To address this challenge, we propose a novel approach based on a guiding principle. We allow the diffusion model to initially propose a layout, and then we rearrange the layout grid. This is achieved by enforcing cross-attention maps (XAMs) to adhere to proposed masks and by migrating pixels from latent maps to new locations determined by us. We introduce new loss terms aimed at reducing XAM entropy for clearer spatial definition of subjects, reduce the overlap between XAMs, and ensure that XAMs align with their respective masks. We contrast our approach with several alternative methods and show that it more faithfully captures the desired concepts across a variety of text prompts.
Paper Structure (14 sections, 9 equations, 7 figures, 6 tables)

This paper contains 14 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: This Figure shows the generation outputs of our method and competitive methods for multiple prompts, with various amount of subjects and objects.
  • Figure 2: Illustrated herein is the sequential evolution of XAMs throughout the generation (backward) process. Commencing on the left with $t=T$, the XAMs exhibit a high degree of spatial entropy, signifying an unorganized state. During Phase 1, spanning $t=(T,T-\tau)$, the process strategically consolidates patches pertaining to identical subjects while concurrently segregating the XAMs of distinct subjects. The resulting XAMs at $t=\tau$ manifest enhanced organization and concentrated focus, enabling a preliminary prediction of the subjects' potential generation loci. Phase 2 involves optimizing the spatial arrangement and generating masks that will be used in Phase 3, the masks presented are after Gaussian smoothing. In Phase 3, the attention maps are subtly coerced to align with predefined masks. The extreme right column depicts the remainder of the diffusion process, which is instrumental in mitigating artifacts induced by the optimization process.
  • Figure 3: The computation of the blocking masks $B_t^s$. The subject tokens $s_i$ are sorted from largest excitation to smallest. At every step, the mask $B_t^{s_i}$ accumulates the masked regions from its predecessors and adds a rectangle around the location of the maximal value in $\tilde{A}_t^{s} = (1-B_t^{s_{i-1}})\odot A_t^{s_i}$. Note (h) will not be used.
  • Figure 4: This figure illustrates Phase2 of the process. On the left, three masks represent the three subjects. In the middle column, we observe the initial masks, which estimate the patches in the attention maps contributing to each subject. On the right, we observe the final masks, after they have been shifted to their new locations. These final masks will subsequently guide the shifting of attention maps $A_t^s$ towards their new location.
  • Figure 5: This figure displays the output generated by our method in response to the prompt: "A chicken and a duck with a ball at the beach." and "A dog, a cat, and a bear at the beach". The three rightmost images depict the attention maps at step 9, a pivotal moment in the generation process that significantly influences the layout of the generated image. This step was specifically chosen to highlight its critical role in determining the spatial arrangement of the depicted entities. This visualization helps us analyze the importance of separating the attention maps.
  • ...and 2 more figures