Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Agneet Chatterjee; Gabriela Ben Melech Stan; Estelle Aflalo; Sayak Paul; Dhruba Ghosh; Tejas Gokhale; Ludwig Schmidt; Hannaneh Hajishirzi; Vasudev Lal; Chitta Baral; Yezhou Yang

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang

TL;DR

This work tackles the persistent problem of spatial inconsistency in text-to-image diffusion models by identifying under-representation of spatial relations in vision-language datasets. It introduces SPRIGHT, a large-scale, spatially focused re-captioning of roughly six million images from four existing datasets, paired with an efficient training pipeline that fine-tunes a diffusion model on a small, object-rich subset to achieve state-of-the-art spatial accuracy. The authors provide extensive ablations, analyses of the CLIP text encoder, and evaluation across benchmarks such as T2I-CompBench and VISOR, demonstrating substantial gains in spatial fidelity while preserving non-spatial image quality; improvements include a 41 percent boost over baselines with fewer than 500 training images and strong gains in FID and CMMD. The work also explores caption length effects, negation handling, and semantic representations, offering a practical pathway to robust spatial reasoning in future vision-language systems and highlighting SPRIGHT’s potential to guide dataset design and training strategies for spatially aware T2I models.

Abstract

One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that support algorithmic solutions to improve spatial reasoning in T2I models. We find that spatial relationships are under-represented in the image descriptions found in current vision-language datasets. To alleviate this data bottleneck, we create SPRIGHT, the first spatially focused, large-scale dataset, by re-captioning 6 million images from 4 widely used vision datasets and through a 3-fold evaluation and analysis pipeline, show that SPRIGHT improves the proportion of spatial relationships in existing datasets. We show the efficacy of SPRIGHT data by showing that using only $\sim$0.25% of SPRIGHT results in a 22% improvement in generating spatially accurate images while also improving FID and CMMD scores. We also find that training on images containing a larger number of objects leads to substantial improvements in spatial consistency, including state-of-the-art results on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Through a set of controlled experiments and ablations, we document additional findings that could support future work that seeks to understand factors that affect spatial consistency in text-to-image models.

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

TL;DR

Abstract

0.25% of SPRIGHT results in a 22% improvement in generating spatially accurate images while also improving FID and CMMD scores. We also find that training on images containing a larger number of objects leads to substantial improvements in spatial consistency, including state-of-the-art results on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Through a set of controlled experiments and ablations, we document additional findings that could support future work that seeks to understand factors that affect spatial consistency in text-to-image models.

Paper Structure (31 sections, 12 figures, 11 tables)

This paper contains 31 sections, 12 figures, 11 tables.

Introduction
Related Work
Text-to-image generative models.
Spatial relationships in T2I models.
Synthetic captions for T2I models.
The SPRIGHT Dataset
Creating the SPRIGHT Dataset
Impact of SPRIGHT
Dataset Validation
Improving Spatial Consistency
Improving upon Baseline Methods
Efficient Training Methodology
Ablation Studies and Analyses
Optimal Ratio of Spatial Captions
Impact of Long and Short Spatial Captions
...and 16 more sections

Figures (12)

Figure 1: We find that existing vision-language datasets do not capture spatial relationships well. To alleviate this shortcoming, we synthetically re-caption $\sim$6M images with a spatial focus, and create the SPRIGHT (SPatially RIGHT) dataset. Shown above are samples from the COCO Validation Set, where text in red denotes ground-truth captions and text in green are corresponding captions from SPRIGHT.
Figure 1: Illustrative examples comparing ground-truth images from COCO and generated images from Baseline SD 2.1 and our model. The images generated by our model exhibit greater fidelity to the input prompts, while also achieving a higher level of photorealism.
Figure 2: Compared to ground truth COCO captions,(Left) Word cloud representations showing that SPRIGHT captions significantly amplify the presence of spatial relationships. (Right)SPRIGHT captions also capture a higher number of object occurances.
Figure 2: Illustrative examples from the SPRIGHT dataset, where the captions are correct in its entirety; both in capturing the spatial relationships and overall description of the image. The images are taken from CC-12M and Segment Anything.
Figure 3: Generated images from our model, as described in Section \ref{['baseline_improve']}, on prompts which contain multiple objects and complex spatial relationships. We curate these prompts from ChatGPT.
...and 7 more figures

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

TL;DR

Abstract

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)