Table of Contents
Fetching ...

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang

TL;DR

This work tackles the persistent problem of spatial inconsistency in text-to-image diffusion models by identifying under-representation of spatial relations in vision-language datasets. It introduces SPRIGHT, a large-scale, spatially focused re-captioning of roughly six million images from four existing datasets, paired with an efficient training pipeline that fine-tunes a diffusion model on a small, object-rich subset to achieve state-of-the-art spatial accuracy. The authors provide extensive ablations, analyses of the CLIP text encoder, and evaluation across benchmarks such as T2I-CompBench and VISOR, demonstrating substantial gains in spatial fidelity while preserving non-spatial image quality; improvements include a 41 percent boost over baselines with fewer than 500 training images and strong gains in FID and CMMD. The work also explores caption length effects, negation handling, and semantic representations, offering a practical pathway to robust spatial reasoning in future vision-language systems and highlighting SPRIGHT’s potential to guide dataset design and training strategies for spatially aware T2I models.

Abstract

One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that support algorithmic solutions to improve spatial reasoning in T2I models. We find that spatial relationships are under-represented in the image descriptions found in current vision-language datasets. To alleviate this data bottleneck, we create SPRIGHT, the first spatially focused, large-scale dataset, by re-captioning 6 million images from 4 widely used vision datasets and through a 3-fold evaluation and analysis pipeline, show that SPRIGHT improves the proportion of spatial relationships in existing datasets. We show the efficacy of SPRIGHT data by showing that using only $\sim$0.25% of SPRIGHT results in a 22% improvement in generating spatially accurate images while also improving FID and CMMD scores. We also find that training on images containing a larger number of objects leads to substantial improvements in spatial consistency, including state-of-the-art results on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Through a set of controlled experiments and ablations, we document additional findings that could support future work that seeks to understand factors that affect spatial consistency in text-to-image models.

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

TL;DR

This work tackles the persistent problem of spatial inconsistency in text-to-image diffusion models by identifying under-representation of spatial relations in vision-language datasets. It introduces SPRIGHT, a large-scale, spatially focused re-captioning of roughly six million images from four existing datasets, paired with an efficient training pipeline that fine-tunes a diffusion model on a small, object-rich subset to achieve state-of-the-art spatial accuracy. The authors provide extensive ablations, analyses of the CLIP text encoder, and evaluation across benchmarks such as T2I-CompBench and VISOR, demonstrating substantial gains in spatial fidelity while preserving non-spatial image quality; improvements include a 41 percent boost over baselines with fewer than 500 training images and strong gains in FID and CMMD. The work also explores caption length effects, negation handling, and semantic representations, offering a practical pathway to robust spatial reasoning in future vision-language systems and highlighting SPRIGHT’s potential to guide dataset design and training strategies for spatially aware T2I models.

Abstract

One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that support algorithmic solutions to improve spatial reasoning in T2I models. We find that spatial relationships are under-represented in the image descriptions found in current vision-language datasets. To alleviate this data bottleneck, we create SPRIGHT, the first spatially focused, large-scale dataset, by re-captioning 6 million images from 4 widely used vision datasets and through a 3-fold evaluation and analysis pipeline, show that SPRIGHT improves the proportion of spatial relationships in existing datasets. We show the efficacy of SPRIGHT data by showing that using only 0.25% of SPRIGHT results in a 22% improvement in generating spatially accurate images while also improving FID and CMMD scores. We also find that training on images containing a larger number of objects leads to substantial improvements in spatial consistency, including state-of-the-art results on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Through a set of controlled experiments and ablations, we document additional findings that could support future work that seeks to understand factors that affect spatial consistency in text-to-image models.
Paper Structure (31 sections, 12 figures, 11 tables)

This paper contains 31 sections, 12 figures, 11 tables.

Figures (12)

  • Figure 1: We find that existing vision-language datasets do not capture spatial relationships well. To alleviate this shortcoming, we synthetically re-caption $\sim$6M images with a spatial focus, and create the SPRIGHT (SPatially RIGHT) dataset. Shown above are samples from the COCO Validation Set, where text in red denotes ground-truth captions and text in green are corresponding captions from SPRIGHT.
  • Figure 1: Illustrative examples comparing ground-truth images from COCO and generated images from Baseline SD 2.1 and our model. The images generated by our model exhibit greater fidelity to the input prompts, while also achieving a higher level of photorealism.
  • Figure 2: Compared to ground truth COCO captions,(Left) Word cloud representations showing that SPRIGHT captions significantly amplify the presence of spatial relationships. (Right)SPRIGHT captions also capture a higher number of object occurances.
  • Figure 2: Illustrative examples from the SPRIGHT dataset, where the captions are correct in its entirety; both in capturing the spatial relationships and overall description of the image. The images are taken from CC-12M and Segment Anything.
  • Figure 3: Generated images from our model, as described in Section \ref{['baseline_improve']}, on prompts which contain multiple objects and complex spatial relationships. We curate these prompts from ChatGPT.
  • ...and 7 more figures