InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization
Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, Di Huang
TL;DR
This work tackles the persistent misalignment between text prompts and diffusion-generated images by identifying invalid initial noise as a core cause. It introduces Initial Noise Optimization (InitNO), which partitions the initial latent space into valid and invalid regions using cross-attention response and self-attention conflict scores, and then optimizes the initial noise to lie in the valid region with a distribution-alignment objective. The method combines a targeted loss L_joint = $\lambda_1 L_{CrossAttn} + \lambda_2 L_{SelfAttn} + \lambda_3 L_{KL}$, where $L_{KL}$ enforces compatibility with the standard Gaussian prior, and updates distribution parameters via Adam to produce prompt-faithful images without retraining. Empirically, InitNO yields superior qualitative and quantitative alignment to prompts, demonstrates strong groundings for layout-to-image tasks, and remains plug-and-play for existing diffusion models, enabling training-free controllable generation with practical impact for grounded T2I synthesis.
Abstract
Recent strides in the development of diffusion models, exemplified by advancements such as Stable Diffusion, have underscored their remarkable prowess in generating visually compelling images. However, the imperative of achieving a seamless alignment between the generated image and the provided prompt persists as a formidable challenge. This paper traces the root of these difficulties to invalid initial noise, and proposes a solution in the form of Initial Noise Optimization (InitNO), a paradigm that refines this noise. Considering text prompts, not all random noises are effective in synthesizing semantically-faithful images. We design the cross-attention response score and the self-attention conflict score to evaluate the initial noise, bifurcating the initial latent space into valid and invalid sectors. A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions. Our method, validated through rigorous experimentation, shows a commendable proficiency in generating images in strict accordance with text prompts. Our code is available at https://github.com/xiefan-guo/initno.
