Table of Contents
Fetching ...

Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis

Woojung Han, Yeonkyung Lee, Chanyoung Kim, Kwanghyun Park, Seong Jae Hwang

TL;DR

STORM introduces training-free Spatial Transport Optimization (STO) to achieve spatially coherent text-to-image synthesis by repositioning attention maps via Optimal Transport. It defines a Spatial Transport Cost (ST Cost) with a directional Positional Cost and a Non-overlap penalty, and computes a Sinkhorn transport plan to move attention mass while updating latent representations in early denoising steps. Across extensive evaluations (VISOR, T2I-CompBench, TIFA, SR2D), STORM demonstrates superior spatial alignment, object accuracy, and attribute binding compared to prior training-free methods, highlighting the importance of early-stage spatial guidance. The method maintains model weights, is compatible with existing diffusion backbones, and opens avenues to couple with stronger text encoders for even better spatial fidelity.

Abstract

Diffusion-based text-to-image (T2I) models have recently excelled in high-quality image generation, particularly in a training-free manner, enabling cost-effective adaptability and generalization across diverse tasks. However, while the existing methods have been continuously focusing on several challenges, such as "missing objects" and "mismatched attributes," another critical issue of "mislocated objects" remains where generated spatial positions fail to align with text prompts. Surprisingly, ensuring such seemingly basic functionality remains challenging in popular T2I models due to the inherent difficulty of imposing explicit spatial guidance via text forms. To address this, we propose STORM (Spatial Transport Optimization by Repositioning Attention Map), a novel training-free approach for spatially coherent T2I synthesis. STORM employs Spatial Transport Optimization (STO), rooted in optimal transport theory, to dynamically adjust object attention maps for precise spatial adherence, supported by a Spatial Transport (ST) Cost function that enhances spatial understanding. Our analysis shows that integrating spatial awareness is most effective in the early denoising stages, while later phases refine details. Extensive experiments demonstrate that STORM surpasses existing methods, effectively mitigating mislocated objects while improving missing and mismatched attributes, setting a new benchmark for spatial alignment in T2I synthesis.

Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis

TL;DR

STORM introduces training-free Spatial Transport Optimization (STO) to achieve spatially coherent text-to-image synthesis by repositioning attention maps via Optimal Transport. It defines a Spatial Transport Cost (ST Cost) with a directional Positional Cost and a Non-overlap penalty, and computes a Sinkhorn transport plan to move attention mass while updating latent representations in early denoising steps. Across extensive evaluations (VISOR, T2I-CompBench, TIFA, SR2D), STORM demonstrates superior spatial alignment, object accuracy, and attribute binding compared to prior training-free methods, highlighting the importance of early-stage spatial guidance. The method maintains model weights, is compatible with existing diffusion backbones, and opens avenues to couple with stronger text encoders for even better spatial fidelity.

Abstract

Diffusion-based text-to-image (T2I) models have recently excelled in high-quality image generation, particularly in a training-free manner, enabling cost-effective adaptability and generalization across diverse tasks. However, while the existing methods have been continuously focusing on several challenges, such as "missing objects" and "mismatched attributes," another critical issue of "mislocated objects" remains where generated spatial positions fail to align with text prompts. Surprisingly, ensuring such seemingly basic functionality remains challenging in popular T2I models due to the inherent difficulty of imposing explicit spatial guidance via text forms. To address this, we propose STORM (Spatial Transport Optimization by Repositioning Attention Map), a novel training-free approach for spatially coherent T2I synthesis. STORM employs Spatial Transport Optimization (STO), rooted in optimal transport theory, to dynamically adjust object attention maps for precise spatial adherence, supported by a Spatial Transport (ST) Cost function that enhances spatial understanding. Our analysis shows that integrating spatial awareness is most effective in the early denoising stages, while later phases refine details. Extensive experiments demonstrate that STORM surpasses existing methods, effectively mitigating mislocated objects while improving missing and mismatched attributes, setting a new benchmark for spatial alignment in T2I synthesis.

Paper Structure

This paper contains 31 sections, 14 equations, 21 figures, 8 tables, 1 algorithm.

Figures (21)

  • Figure 1: Three main challenges in training-free text-to-image (T2I) generation: (1) missing objects, (2) mismatched attributes, and (3) mislocated objects. While existing approaches address missing objects and mismatched attributes, effectively controlling object positioning remains an open problem. Our proposed model, STORM, introduces a dynamic approach to aligning relative object positions throughout the denoising process, enabling precise spatial control without additional spatial templates. Red underline highlights errors made by SD.
  • Figure 2: Comparison of Spatial Awareness. $\{ \ \texttt{position$^*$} \ \}$ in each prompt denotes the spatial relationship in each column (e.g., "to the left of", "to the right of", "above, and "below"). While Stable Diffusion (SD) shows limited spatial awareness by generating similar images regardless of spatial prompts, our model accurately reflects specified positions. (Same seed for all synthesis).
  • Figure 3: Comparison of Attention Map Progression During Denoising. Visualization of the attention maps for "a red bird to the right of a green plant" throughout the denoising process for both Stable Diffusion (a) and our model (b). While Stable Diffusion struggles to distinctly capture the spatial relationship between the bird and the plant, our model effectively aligns the objects according to the specified spatial cue ("to the right of"). The resulting image from our model demonstrates improved spatial accuracy compared to Stable Diffusion.
  • Figure 4: Overview pipeline of STORM. Our method leverages Optimal Transport in a training-free manner, allowing the model to accurately reflect relative object positions at each step without additional inputs. Given the prompt "A car to the left of an elephant", our method dynamically adjusts the attention maps to induce the specified spatial relationship. The process starts with initial attention maps for the "car" and "elephant" at time step $z_t$. Using the centroids of these attention maps, the Spatial Transport Optimization (STO) computes the losses to correct positional relationships (e.g., ensuring the car is to the left of the elephant). The updated attention map is then used to refine the latent representation $z_t$, leading to a final image that adheres to the desired spatial arrangement. The comparison of attention maps (before and after STO) shows improved alignment, effectively placing the car to the left of the elephant as instructed in the prompt.
  • Figure 5: Qualitative comparison across the custom prompt, which involves attribute and positional information in text, evaluating previous state-of-the-art training-free T2I methods, Attend&Excite attend_excite, Divide&Bind divide_bind, INITNO guo2024initno, CONFORM conform, and ours.
  • ...and 16 more figures