Table of Contents
Fetching ...

InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

Sarah Rastegar, Violeta Chatalbasheva, Sieger Falkena, Anuj Singh, Yanbo Wang, Tejas Gokhale, Hamid Palangi, Hadi Jamali-Rad

TL;DR

The paper tackles the persistent problem of spatial misalignment in text-to-image diffusion models. It introduces InfSplign, a training-free, inference-time method that uses a compound loss derived from cross-attention maps to enforce spatial grounding and object preservation during sampling. By estimating object centroids and variances from coarse and mid-level attention, and applying three complementary losses during denoising, InfSplign guides latent trajectories toward spatially coherent generations. Extensive experiments on VISOR and T2I-CompBench demonstrate state-of-the-art improvements over both inference-time baselines and fine-tuning approaches, with ablations validating the contributions of each loss term. The approach offers a lightweight, plug-and-play improvement to diffusion backbones without extra inputs or retraining, potentially broadening practical adoption for spatially-aware image synthesis.

Abstract

Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.

InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

TL;DR

The paper tackles the persistent problem of spatial misalignment in text-to-image diffusion models. It introduces InfSplign, a training-free, inference-time method that uses a compound loss derived from cross-attention maps to enforce spatial grounding and object preservation during sampling. By estimating object centroids and variances from coarse and mid-level attention, and applying three complementary losses during denoising, InfSplign guides latent trajectories toward spatially coherent generations. Extensive experiments on VISOR and T2I-CompBench demonstrate state-of-the-art improvements over both inference-time baselines and fine-tuning approaches, with ablations validating the contributions of each loss term. The approach offers a lightweight, plug-and-play improvement to diffusion backbones without extra inputs or retraining, potentially broadening practical adoption for spatially-aware image synthesis.

Abstract

Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.

Paper Structure

This paper contains 12 sections, 8 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: InfSplign is a training-free inference-time method that improves spatial understanding of text-to-image (T2I) Stable Diffusion (SD) models, namely SD v$1.4$, SD v$2.1$ and SDXL.
  • Figure 2: Overview of the proposed approach.
  • Figure 3: Attention energy across decoder cross-attention layers. Coarse layers encode global structure, with $\mathcal{L}_{\text{presence}}$ enforcing focused attention; mid-level layers encode local detail, with $\mathcal{L}_{\text{balance}}$ equalizing object energy to prevent dominance.
  • Figure 4: Qualitative comparison with SD across different VISOR prompts.
  • Figure 5: Comparison of spatial understanding across T2I diffusion models. InfSplign consistently aligns objects according to the target relation better than SD v$1.4$rombach2022high, INITNO guo2024initno, CONFORM meral2024conform and STORM han2025spatial.
  • ...and 8 more figures