Table of Contents
Fetching ...

Exploring Spatial Schema Intuitions in Large Language and Vision Models

Philipp Wicke, Lennart Wachowiak

TL;DR

The paper investigates whether non-embodied large language and vision-language models encode human spatial intuitions anchored in image schemas. By reproducing three classic psycholinguistic experiments with a range of LLMs and VLMs and applying careful prompt design and correlation analyses, it shows that large models can exhibit moderate alignment with human spatial judgments, though results vary by modality and model type. Key findings include strong alignment for textual/pseudo-visual prompts (notably GPT-4) and weaker, often negligible correlations for open-source VLMs, suggesting that embodiment is not strictly necessary for capturing spatial intuitions but that grounding and modality strongly shape performance. The work provides a foundation for understanding the relationship between language, spatial experience, and model computations, with implications for grounding, multilingual extension, and responsible deployment in real-world tasks.

Abstract

Despite the ubiquity of large language models (LLMs) in AI research, the question of embodiment in LLMs remains underexplored, distinguishing them from embodied systems in robotics where sensory perception directly informs physical action. Our investigation navigates the intriguing terrain of whether LLMs, despite their non-embodied nature, effectively capture implicit human intuitions about fundamental, spatial building blocks of language. We employ insights from spatial cognitive foundations developed through early sensorimotor experiences, guiding our exploration through the reproduction of three psycholinguistic experiments. Surprisingly, correlations between model outputs and human responses emerge, revealing adaptability without a tangible connection to embodied experiences. Notable distinctions include polarized language model responses and reduced correlations in vision language models. This research contributes to a nuanced understanding of the interplay between language, spatial experiences, and the computations made by large language models. More at https://cisnlp.github.io/Spatial_Schemas/

Exploring Spatial Schema Intuitions in Large Language and Vision Models

TL;DR

The paper investigates whether non-embodied large language and vision-language models encode human spatial intuitions anchored in image schemas. By reproducing three classic psycholinguistic experiments with a range of LLMs and VLMs and applying careful prompt design and correlation analyses, it shows that large models can exhibit moderate alignment with human spatial judgments, though results vary by modality and model type. Key findings include strong alignment for textual/pseudo-visual prompts (notably GPT-4) and weaker, often negligible correlations for open-source VLMs, suggesting that embodiment is not strictly necessary for capturing spatial intuitions but that grounding and modality strongly shape performance. The work provides a foundation for understanding the relationship between language, spatial experience, and model computations, with implications for grounding, multilingual extension, and responsible deployment in real-world tasks.

Abstract

Despite the ubiquity of large language models (LLMs) in AI research, the question of embodiment in LLMs remains underexplored, distinguishing them from embodied systems in robotics where sensory perception directly informs physical action. Our investigation navigates the intriguing terrain of whether LLMs, despite their non-embodied nature, effectively capture implicit human intuitions about fundamental, spatial building blocks of language. We employ insights from spatial cognitive foundations developed through early sensorimotor experiences, guiding our exploration through the reproduction of three psycholinguistic experiments. Surprisingly, correlations between model outputs and human responses emerge, revealing adaptability without a tangible connection to embodied experiences. Notable distinctions include polarized language model responses and reduced correlations in vision language models. This research contributes to a nuanced understanding of the interplay between language, spatial experiences, and the computations made by large language models. More at https://cisnlp.github.io/Spatial_Schemas/
Paper Structure (31 sections, 1 equation, 3 figures, 9 tables)

This paper contains 31 sections, 1 equation, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Overview of the three experiments
  • Figure 2: Target images form the original study by richardson2001language. Each participant was asked to match 30 verbs to one of the images (A-D).
  • Figure 3: Distribution of image schema choice for items "bombed" and "lifted" by humans (bold) and GPT-4 (light).