Table of Contents
Fetching ...

Spatial-Aware Latent Initialization for Controllable Image Generation

Wenqiang Sun, Teng Li, Zehong Lin, Jun Zhang

TL;DR

This work tackles the difficulty of enforcing spatial layouts in text-to-image diffusion by introducing Spatial-Aware Latent Initialization (SALT), which uses the DDIM inversion latent $z_T^*$ as initialization to preserve spatial cues during denoising. By coupling this spatial-aware initialization with an attention-guided layout loss $L$ (with a regularization term) and a brief number of optimization steps, SALT delivers significantly improved layout adherence (IoU and mAP@0.5) on COCO while maintaining image quality comparable to baseline methods. The approach is plug-and-play for training-free layout guidance, does not require model retraining, and demonstrates strong performance gains over prior zero-shot methods, particularly in multi-object layouts. Practically, SALT reduces the need for lengthy optimization and extends reliable layout control to open-vocabulary prompts, albeit with some CLIP-score trade-offs due to spatial sparsity constraints.

Abstract

Recently, text-to-image diffusion models have demonstrated impressive ability to generate high-quality images conditioned on the textual input. However, these models struggle to accurately adhere to textual instructions regarding spatial layout information. While previous research has primarily focused on aligning cross-attention maps with layout conditions, they overlook the impact of the initialization noise on the layout guidance. To achieve better layout control, we propose leveraging a spatial-aware initialization noise during the denoising process. Specifically, we find that the inverted reference image with finite inversion steps contains valuable spatial awareness regarding the object's position, resulting in similar layouts in the generated images. Based on this observation, we develop an open-vocabulary framework to customize a spatial-aware initialization noise for each layout condition. Without modifying other modules except the initialization noise, our approach can be seamlessly integrated as a plug-and-play module within other training-free layout guidance frameworks. We evaluate our approach quantitatively and qualitatively on the available Stable Diffusion model and COCO dataset. Equipped with the spatial-aware latent initialization, our method significantly improves the effectiveness of layout guidance while preserving high-quality content.

Spatial-Aware Latent Initialization for Controllable Image Generation

TL;DR

This work tackles the difficulty of enforcing spatial layouts in text-to-image diffusion by introducing Spatial-Aware Latent Initialization (SALT), which uses the DDIM inversion latent as initialization to preserve spatial cues during denoising. By coupling this spatial-aware initialization with an attention-guided layout loss (with a regularization term) and a brief number of optimization steps, SALT delivers significantly improved layout adherence (IoU and mAP@0.5) on COCO while maintaining image quality comparable to baseline methods. The approach is plug-and-play for training-free layout guidance, does not require model retraining, and demonstrates strong performance gains over prior zero-shot methods, particularly in multi-object layouts. Practically, SALT reduces the need for lengthy optimization and extends reliable layout control to open-vocabulary prompts, albeit with some CLIP-score trade-offs due to spatial sparsity constraints.

Abstract

Recently, text-to-image diffusion models have demonstrated impressive ability to generate high-quality images conditioned on the textual input. However, these models struggle to accurately adhere to textual instructions regarding spatial layout information. While previous research has primarily focused on aligning cross-attention maps with layout conditions, they overlook the impact of the initialization noise on the layout guidance. To achieve better layout control, we propose leveraging a spatial-aware initialization noise during the denoising process. Specifically, we find that the inverted reference image with finite inversion steps contains valuable spatial awareness regarding the object's position, resulting in similar layouts in the generated images. Based on this observation, we develop an open-vocabulary framework to customize a spatial-aware initialization noise for each layout condition. Without modifying other modules except the initialization noise, our approach can be seamlessly integrated as a plug-and-play module within other training-free layout guidance frameworks. We evaluate our approach quantitatively and qualitatively on the available Stable Diffusion model and COCO dataset. Equipped with the spatial-aware latent initialization, our method significantly improves the effectiveness of layout guidance while preserving high-quality content.
Paper Structure (15 sections, 6 equations, 13 figures, 9 tables)

This paper contains 15 sections, 6 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Our method requires a textual prompt and a bounding box set (the first and forth columns to the left) as the layout condition. The token that has a position requirement is marked in red. For each caption and position pair, we generate images by leveraging our layout guidance approach. As shown in this figure, our method is capable of generating high-quality images with the desired layouts.
  • Figure 2: Layout Guidance Using Different Seeds. Given the same random noise, prompt, and bounding box, we generate images using Stable Diffusion (SD) (top row) and SD with the layout guidance method proposed in chen2023training (bottom row), respectively. From the same initialization noise $z_T$, when images generated by SD are close to the given layout condition, the layout guidance method achieves effective control and high-quality generation (forth and fifth columns to the left). Otherwise, the layout performance and image quality may degrade (second and third columns to the left).
  • Figure 3: Sampling using DDIM inversion latent. We first invert original images (leftmost column) to obtain DDIM inversion latent $z_T^*$, and use the latent variable $z_T^*$ as the initialization noise to sample images. Given different captions, we clearly see that the layouts of synthetic images (first to third columns to the right) are highly consistent with the input images.
  • Figure 4: Cross-attention map visualization. By utilizing the DDIM inversion latent $z_T^*$ as the initialization noise (upper row) for image sampling, the cross-attention map of the target object "cat" shows a consistent tendency across sampling timesteps. In contrast, the cross-attention map changes significantly during sampling when using a random initialization noise (lower row).
  • Figure 5: Overview of the proposed framework. Given layout conditions, we customize a reference image by transforming an open-vocabulary object mask to the same shapes and positioning them correspondingly. This image is then processed by DDIM invesion to obtain a spatial-aware latent as the initialization noise. In the denoising process, we utilize the cross-attention map to optimize the latent for more accurate layout control.
  • ...and 8 more figures