Table of Contents
Fetching ...

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, Di Huang

TL;DR

This work tackles the persistent misalignment between text prompts and diffusion-generated images by identifying invalid initial noise as a core cause. It introduces Initial Noise Optimization (InitNO), which partitions the initial latent space into valid and invalid regions using cross-attention response and self-attention conflict scores, and then optimizes the initial noise to lie in the valid region with a distribution-alignment objective. The method combines a targeted loss L_joint = $\lambda_1 L_{CrossAttn} + \lambda_2 L_{SelfAttn} + \lambda_3 L_{KL}$, where $L_{KL}$ enforces compatibility with the standard Gaussian prior, and updates distribution parameters via Adam to produce prompt-faithful images without retraining. Empirically, InitNO yields superior qualitative and quantitative alignment to prompts, demonstrates strong groundings for layout-to-image tasks, and remains plug-and-play for existing diffusion models, enabling training-free controllable generation with practical impact for grounded T2I synthesis.

Abstract

Recent strides in the development of diffusion models, exemplified by advancements such as Stable Diffusion, have underscored their remarkable prowess in generating visually compelling images. However, the imperative of achieving a seamless alignment between the generated image and the provided prompt persists as a formidable challenge. This paper traces the root of these difficulties to invalid initial noise, and proposes a solution in the form of Initial Noise Optimization (InitNO), a paradigm that refines this noise. Considering text prompts, not all random noises are effective in synthesizing semantically-faithful images. We design the cross-attention response score and the self-attention conflict score to evaluate the initial noise, bifurcating the initial latent space into valid and invalid sectors. A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions. Our method, validated through rigorous experimentation, shows a commendable proficiency in generating images in strict accordance with text prompts. Our code is available at https://github.com/xiefan-guo/initno.

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

TL;DR

This work tackles the persistent misalignment between text prompts and diffusion-generated images by identifying invalid initial noise as a core cause. It introduces Initial Noise Optimization (InitNO), which partitions the initial latent space into valid and invalid regions using cross-attention response and self-attention conflict scores, and then optimizes the initial noise to lie in the valid region with a distribution-alignment objective. The method combines a targeted loss L_joint = , where enforces compatibility with the standard Gaussian prior, and updates distribution parameters via Adam to produce prompt-faithful images without retraining. Empirically, InitNO yields superior qualitative and quantitative alignment to prompts, demonstrates strong groundings for layout-to-image tasks, and remains plug-and-play for existing diffusion models, enabling training-free controllable generation with practical impact for grounded T2I synthesis.

Abstract

Recent strides in the development of diffusion models, exemplified by advancements such as Stable Diffusion, have underscored their remarkable prowess in generating visually compelling images. However, the imperative of achieving a seamless alignment between the generated image and the provided prompt persists as a formidable challenge. This paper traces the root of these difficulties to invalid initial noise, and proposes a solution in the form of Initial Noise Optimization (InitNO), a paradigm that refines this noise. Considering text prompts, not all random noises are effective in synthesizing semantically-faithful images. We design the cross-attention response score and the self-attention conflict score to evaluate the initial noise, bifurcating the initial latent space into valid and invalid sectors. A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions. Our method, validated through rigorous experimentation, shows a commendable proficiency in generating images in strict accordance with text prompts. Our code is available at https://github.com/xiefan-guo/initno.
Paper Structure (14 sections, 9 equations, 13 figures, 1 table)

This paper contains 14 sections, 9 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Example results synthesized by SD and ours.
  • Figure 2: InitNO. Our investigation dives into the exploration of various random noise configurations and their subsequent influence on the generated results. Notably, when different noises are input into SD under identical text prompts, there are marked discrepancy in the alignment between the generated image and the given text. Unsuccessful cases are delineated by gray contours, while successful instances are indicated by yellow contours. This observation underscores the pivotal role of initial noise in determining the success of the generation process. Based on this insight, we divide the initial noise space into valid and invalid regions. Introducing Initial Noise Optimization (InitNO), identified as orange arrow, our method is capable of guiding any initial noise into the valid region, thereby synthesizing high-fidelity results (orange contours) that precisely correspond to the given prompt. The same location employs the same random seed.
  • Figure 3: Visualization of the attention maps.
  • Figure 4: Comparison of Attend-and-Excite with our method. Attend-and-Excite suffers from the trade-off between under-optimization and over-optimization. Under-optimization (lower path) remains confined to invalid regions, while over-optimization (upper path) carries the risk of deviating from the distribution of diffusion model. Our approach (middle path) skillfully addresses this challenge by prioritizing noise optimization in the initial latent space, ensuring sufficient and appropriate optimization.
  • Figure 5: Effect of the scale factor of Attend-and-Excite. Given the same text prompt, we adjust the scale factor of Attend-and-Excite chefer2023attend, 20 is used in the original work. Images of varying quality are synthesized, and the red box indicates the highest quality image. Images on the same row share an identical random seed.
  • ...and 8 more figures