All Seeds Are Not Equal: Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds
Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann
TL;DR
This work reveals that initial random seeds significantly influence compositional text-to-image generation in diffusion models, causing variability in object counts and spatial relations. It introduces Comp90, a dataset for systematic seed analysis, and a seed-mining pipeline using CogVLM2 to identify reliable seeds, which are then used to generate self-generated data for fine-tuning. By restricting fine-tuning to the attention Q/K projections and training on self-generated data from top seeds, the authors achieve substantial gains in numerical and spatial composition (e.g., ~29.3% and ~60.7% improvements on Stable Diffusion for different tasks) and maintain image quality and diversity. The approach provides a practical, annotation-free path to enhance compositional capabilities across models like Stable Diffusion 2.1 and PixArt-α without requiring explicit layout inputs, with broad implications for reliable image synthesis in complex prompts.
Abstract
Text-to-image diffusion models have demonstrated remarkable capability in generating realistic images from arbitrary text prompts. However, they often produce inconsistent results for compositional prompts such as "two dogs" or "a penguin on the right of a bowl". Understanding these inconsistencies is crucial for reliable image generation. In this paper, we highlight the significant role of initial noise in these inconsistencies, where certain noise patterns are more reliable for compositional prompts than others. Our analyses reveal that different initial random seeds tend to guide the model to place objects in distinct image areas, potentially adhering to specific patterns of camera angles and image composition associated with the seed. To improve the model's compositional ability, we propose a method for mining these reliable cases, resulting in a curated training set of generated images without requiring any manual annotation. By fine-tuning text-to-image models on these generated images, we significantly enhance their compositional capabilities. For numerical composition, we observe relative increases of 29.3% and 19.5% for Stable Diffusion and PixArt-α, respectively. Spatial composition sees even larger gains, with 60.7% for Stable Diffusion and 21.1% for PixArt-α.
