Table of Contents
Fetching ...

All Seeds Are Not Equal: Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds

Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann

TL;DR

This work reveals that initial random seeds significantly influence compositional text-to-image generation in diffusion models, causing variability in object counts and spatial relations. It introduces Comp90, a dataset for systematic seed analysis, and a seed-mining pipeline using CogVLM2 to identify reliable seeds, which are then used to generate self-generated data for fine-tuning. By restricting fine-tuning to the attention Q/K projections and training on self-generated data from top seeds, the authors achieve substantial gains in numerical and spatial composition (e.g., ~29.3% and ~60.7% improvements on Stable Diffusion for different tasks) and maintain image quality and diversity. The approach provides a practical, annotation-free path to enhance compositional capabilities across models like Stable Diffusion 2.1 and PixArt-α without requiring explicit layout inputs, with broad implications for reliable image synthesis in complex prompts.

Abstract

Text-to-image diffusion models have demonstrated remarkable capability in generating realistic images from arbitrary text prompts. However, they often produce inconsistent results for compositional prompts such as "two dogs" or "a penguin on the right of a bowl". Understanding these inconsistencies is crucial for reliable image generation. In this paper, we highlight the significant role of initial noise in these inconsistencies, where certain noise patterns are more reliable for compositional prompts than others. Our analyses reveal that different initial random seeds tend to guide the model to place objects in distinct image areas, potentially adhering to specific patterns of camera angles and image composition associated with the seed. To improve the model's compositional ability, we propose a method for mining these reliable cases, resulting in a curated training set of generated images without requiring any manual annotation. By fine-tuning text-to-image models on these generated images, we significantly enhance their compositional capabilities. For numerical composition, we observe relative increases of 29.3% and 19.5% for Stable Diffusion and PixArt-α, respectively. Spatial composition sees even larger gains, with 60.7% for Stable Diffusion and 21.1% for PixArt-α.

All Seeds Are Not Equal: Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds

TL;DR

This work reveals that initial random seeds significantly influence compositional text-to-image generation in diffusion models, causing variability in object counts and spatial relations. It introduces Comp90, a dataset for systematic seed analysis, and a seed-mining pipeline using CogVLM2 to identify reliable seeds, which are then used to generate self-generated data for fine-tuning. By restricting fine-tuning to the attention Q/K projections and training on self-generated data from top seeds, the authors achieve substantial gains in numerical and spatial composition (e.g., ~29.3% and ~60.7% improvements on Stable Diffusion for different tasks) and maintain image quality and diversity. The approach provides a practical, annotation-free path to enhance compositional capabilities across models like Stable Diffusion 2.1 and PixArt-α without requiring explicit layout inputs, with broad implications for reliable image synthesis in complex prompts.

Abstract

Text-to-image diffusion models have demonstrated remarkable capability in generating realistic images from arbitrary text prompts. However, they often produce inconsistent results for compositional prompts such as "two dogs" or "a penguin on the right of a bowl". Understanding these inconsistencies is crucial for reliable image generation. In this paper, we highlight the significant role of initial noise in these inconsistencies, where certain noise patterns are more reliable for compositional prompts than others. Our analyses reveal that different initial random seeds tend to guide the model to place objects in distinct image areas, potentially adhering to specific patterns of camera angles and image composition associated with the seed. To improve the model's compositional ability, we propose a method for mining these reliable cases, resulting in a curated training set of generated images without requiring any manual annotation. By fine-tuning text-to-image models on these generated images, we significantly enhance their compositional capabilities. For numerical composition, we observe relative increases of 29.3% and 19.5% for Stable Diffusion and PixArt-α, respectively. Spatial composition sees even larger gains, with 60.7% for Stable Diffusion and 21.1% for PixArt-α.

Paper Structure

This paper contains 37 sections, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Example images generated by Stable Diffusion 2.1 and ours. Existing text-to-image diffusion models are prone to making mistakes at numeracy and spatial relations.
  • Figure 2: Initial Seeds and the average attention maps of object tokens. We generate 64 images for each initial seed from $0 \sim 7$, using Stable Diffusion 2.1 (left) and PixArt-$\alpha$ (right) - each image visualizes one seed. For each seed, we show the average binarized cross-attention maps.
  • Figure 4: Averaged object attention masks of generated images with correct and incorrect object counts/positions. We generate 300 images for each of the four prompts with random seeds, using Stable Diffusion 2.1 (Left) and PixArt-$\alpha$ (Right). For each prompt, we compute the average of the binarized cross-attention maps. The rightmost plot in each panel visualizes the cross-attention maps of the 300 generated images using t-SNE van2008tsne, showing that the attention maps of correct images and incorrect images tend to form different clusters.
  • Figure 5: Overview of the proposed approach. We take spatial composition as an example to illustrate (a) our seed mining strategy. With reliable seeds (e.g., seed 8 in this case), we can (b) directly enhance the generation process to improve the compositional accuracy, or (c) fine-tune the model to achieve seed-independent enhancement.
  • Figure 6: Accuracy distributions of random seeds on different tasks. Each line depicts the performance of 100 seeds for the corresponding task, sorted by their performance. As can be seen, top-performing seeds significantly outperform the rest.
  • ...and 4 more figures