Be Decisive: Noise-Induced Layouts for Multi-Subject Generation
Omer Dahary, Yehonathan Cohen, Or Patashnik, Kfir Aberman, Daniel Cohen-Or
TL;DR
This work tackles the difficulty of multi-subject image generation in diffusion models by replacing externally imposed layouts with a noise-induced layout that is prompt-aligned and refined during denoising. A lightweight network predicts a soft-layout $S^t$ from diffusion features and guides the process via hard-layouts $M^t$, enforced by Decisive Guidance that combines cross-attention, intra-cluster variance, and temporal boundary losses. By deriving layouts from the initial noise and updating them iteratively, the method preserves the model's prior, reduces inter-subject leakage, and achieves diverse, prompt-faithful compositions across many subjects and attributes. The approach delivers strong text-image alignment and layout diversity while maintaining distributional diversity, albeit with higher computational cost and reliance on the pretrained model's multi-subject exposure. Overall, it enables robust, layout-free multi-subject generation that scales to complex prompts and personalization, advancing practical applications in image synthesis.
Abstract
Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject's spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model's prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model's prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model's original distribution.
