Table of Contents
Fetching ...

Be Decisive: Noise-Induced Layouts for Multi-Subject Generation

Omer Dahary, Yehonathan Cohen, Or Patashnik, Kfir Aberman, Daniel Cohen-Or

TL;DR

This work tackles the difficulty of multi-subject image generation in diffusion models by replacing externally imposed layouts with a noise-induced layout that is prompt-aligned and refined during denoising. A lightweight network predicts a soft-layout $S^t$ from diffusion features and guides the process via hard-layouts $M^t$, enforced by Decisive Guidance that combines cross-attention, intra-cluster variance, and temporal boundary losses. By deriving layouts from the initial noise and updating them iteratively, the method preserves the model's prior, reduces inter-subject leakage, and achieves diverse, prompt-faithful compositions across many subjects and attributes. The approach delivers strong text-image alignment and layout diversity while maintaining distributional diversity, albeit with higher computational cost and reliance on the pretrained model's multi-subject exposure. Overall, it enables robust, layout-free multi-subject generation that scales to complex prompts and personalization, advancing practical applications in image synthesis.

Abstract

Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject's spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model's prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model's prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model's original distribution.

Be Decisive: Noise-Induced Layouts for Multi-Subject Generation

TL;DR

This work tackles the difficulty of multi-subject image generation in diffusion models by replacing externally imposed layouts with a noise-induced layout that is prompt-aligned and refined during denoising. A lightweight network predicts a soft-layout from diffusion features and guides the process via hard-layouts , enforced by Decisive Guidance that combines cross-attention, intra-cluster variance, and temporal boundary losses. By deriving layouts from the initial noise and updating them iteratively, the method preserves the model's prior, reduces inter-subject leakage, and achieves diverse, prompt-faithful compositions across many subjects and attributes. The approach delivers strong text-image alignment and layout diversity while maintaining distributional diversity, albeit with higher computational cost and reliance on the pretrained model's multi-subject exposure. Overall, it enables robust, layout-free multi-subject generation that scales to complex prompts and personalization, advancing practical applications in image synthesis.

Abstract

Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject's spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model's prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model's prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model's original distribution.

Paper Structure

This paper contains 33 sections, 5 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Our method generates images with multiple subjects without requiring external layout inputs. By following the innate noise-induced layout encoded in the sampled initial noise, we preserve the model's prior and achieve diverse compositions. The second row show the initial noise-induced layout of the corresponding output images above. As can be seen, the initial layouts reflect the final composition of the generated images.
  • Figure 2: Our method steers the denoising process by applying iterative guidance (turquoise box) after each denoising step (orange regions). At denoising step $t$ (left orange box), we predict a soft-layout $S^t$ based on the diffusion model's features, and cluster it to form a hard-layout $M^t$ (purple box). This hard-layout is then used to control the layout of the next denoising step (right orange box). In the guidance stage, we optimize the latent image, with the objective to align its associated updated soft-layout with the hard-layout $M^t$.
  • Figure 3: The figure illustrates the progression of the soft- and hard-layouts in three cases. The top row shows results from our full method. The middle row presents our method without guidance. The bottom row shows vanilla SDXL, where only the soft-layout extracted from the noisy latents is displayed. Below each image, we show the hard-layout obtained at the final timestep.
  • Figure 4: Without guidance, we observe two types of layout failures: (a) intra-cluster over-generation, where multiple subjects are assigned to a single cluster due to high variance in the soft-layout; and (b) inconsistent cluster borders across timesteps, leading to subject over-generation and leakage caused by oscillating boundaries.
  • Figure 5: Generated images across different seeds. Our method follows the noise-induce layouts to generate prompt-aligned images with diverse compositions.
  • ...and 8 more figures