Spatial-Aware Latent Initialization for Controllable Image Generation
Wenqiang Sun, Teng Li, Zehong Lin, Jun Zhang
TL;DR
This work tackles the difficulty of enforcing spatial layouts in text-to-image diffusion by introducing Spatial-Aware Latent Initialization (SALT), which uses the DDIM inversion latent $z_T^*$ as initialization to preserve spatial cues during denoising. By coupling this spatial-aware initialization with an attention-guided layout loss $L$ (with a regularization term) and a brief number of optimization steps, SALT delivers significantly improved layout adherence (IoU and mAP@0.5) on COCO while maintaining image quality comparable to baseline methods. The approach is plug-and-play for training-free layout guidance, does not require model retraining, and demonstrates strong performance gains over prior zero-shot methods, particularly in multi-object layouts. Practically, SALT reduces the need for lengthy optimization and extends reliable layout control to open-vocabulary prompts, albeit with some CLIP-score trade-offs due to spatial sparsity constraints.
Abstract
Recently, text-to-image diffusion models have demonstrated impressive ability to generate high-quality images conditioned on the textual input. However, these models struggle to accurately adhere to textual instructions regarding spatial layout information. While previous research has primarily focused on aligning cross-attention maps with layout conditions, they overlook the impact of the initialization noise on the layout guidance. To achieve better layout control, we propose leveraging a spatial-aware initialization noise during the denoising process. Specifically, we find that the inverted reference image with finite inversion steps contains valuable spatial awareness regarding the object's position, resulting in similar layouts in the generated images. Based on this observation, we develop an open-vocabulary framework to customize a spatial-aware initialization noise for each layout condition. Without modifying other modules except the initialization noise, our approach can be seamlessly integrated as a plug-and-play module within other training-free layout guidance frameworks. We evaluate our approach quantitatively and qualitatively on the available Stable Diffusion model and COCO dataset. Equipped with the spatial-aware latent initialization, our method significantly improves the effectiveness of layout guidance while preserving high-quality content.
