Table of Contents
Fetching ...

Evaluating the Generation of Spatial Relations in Text and Image Generative Models

Shang Hong Sim, Clarence Lee, Alvin Tan, Cheston Tan

TL;DR

Surprisingly, it is found that T2I models only achieve subpar performance despite their impressive general image-generation abilities, and LLMs are significantly more accurate than T2I models in generating spatial relations, despite being primarily trained on textual data.

Abstract

Understanding spatial relations is a crucial cognitive ability for both humans and AI. While current research has predominantly focused on the benchmarking of text-to-image (T2I) models, we propose a more comprehensive evaluation that includes \textit{both} T2I and Large Language Models (LLMs). As spatial relations are naturally understood in a visuo-spatial manner, we develop an approach to convert LLM outputs into an image, thereby allowing us to evaluate both T2I models and LLMs \textit{visually}. We examined the spatial relation understanding of 8 prominent generative models (3 T2I models and 5 LLMs) on a set of 10 common prepositions, as well as assess the feasibility of automatic evaluation methods. Surprisingly, we found that T2I models only achieve subpar performance despite their impressive general image-generation abilities. Even more surprisingly, our results show that LLMs are significantly more accurate than T2I models in generating spatial relations, despite being primarily trained on textual data. We examined reasons for model failures and highlight gaps that can be filled to enable more spatially faithful generations.

Evaluating the Generation of Spatial Relations in Text and Image Generative Models

TL;DR

Surprisingly, it is found that T2I models only achieve subpar performance despite their impressive general image-generation abilities, and LLMs are significantly more accurate than T2I models in generating spatial relations, despite being primarily trained on textual data.

Abstract

Understanding spatial relations is a crucial cognitive ability for both humans and AI. While current research has predominantly focused on the benchmarking of text-to-image (T2I) models, we propose a more comprehensive evaluation that includes \textit{both} T2I and Large Language Models (LLMs). As spatial relations are naturally understood in a visuo-spatial manner, we develop an approach to convert LLM outputs into an image, thereby allowing us to evaluate both T2I models and LLMs \textit{visually}. We examined the spatial relation understanding of 8 prominent generative models (3 T2I models and 5 LLMs) on a set of 10 common prepositions, as well as assess the feasibility of automatic evaluation methods. Surprisingly, we found that T2I models only achieve subpar performance despite their impressive general image-generation abilities. Even more surprisingly, our results show that LLMs are significantly more accurate than T2I models in generating spatial relations, despite being primarily trained on textual data. We examined reasons for model failures and highlight gaps that can be filled to enable more spatially faithful generations.

Paper Structure

This paper contains 19 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Images generated using both T2I and LLMs for some example prompts.
  • Figure 2: Overall image-generation pipelines for LLMs and T2I models.
  • Figure 3: Examples of images generated using GPT-4 and Mixtral 8x7b for complex prompts.
  • Figure 4: Distribution of human ratings of images generated for simple prompts. Detailed descriptions of the four ratings are as follows. A: Spatial relationship is correct, and the number and type of objects is correct. B: Spatial relationship is correct, but one type of object is wrong, or number of objects is wrong. C: Spatial relationship is wrong, but the type and number of objects are correct. D: Spatial relationship is wrong, and the type of objects are wrong.
  • Figure 5: Distribution of human ratings for complex prompts.A: Both spatial relations are correct, and the number and type of objects are all correct. B: Only one spatial relation is correct, but the number and type of objects are all correct. C: Neither of the spatial relations is correct, but the number and type of objects are all correct. D: The number or type of objects are wrong.
  • ...and 1 more figures