Table of Contents
Fetching ...

Vision language models are unreliable at trivial spatial cognition

Sangeet Khemlani, Tyler Tran, Nathaniel Gyory, Anthony M. Harrison, Wallace E. Lawson, Ravenna Thielstrom, Hunter Thompson, Taaren Singh, J. Gregory Trafton

TL;DR

This work probes the reliability of vision-language architectures in trivial spatial cognition tasks by introducing TableTest, a synthetic benchmark of spatial relations among objects on a table. It evaluates three recent architectures under multimodal and text-only prompts across eight prompt variants, revealing substantial prompt-dependent variability and, in many cases, performance far below reliable human-like baselines. The findings indicate systematic limitations in how these models reason about spatial relations, especially under negative or disjunctive prompts, and suggest that training data biases (e.g., sparse spatial relations in image captions) may underlie these failures. The paper advocates for broader, cross-prompt benchmarking and synthetic-data-driven approaches to bolstering training corpora, while cautioning against over-interpreting single-prompt success as general spatial competence.

Abstract

Vision language models (VLMs) are designed to extract relevant visuospatial information from images. Some research suggests that VLMs can exhibit humanlike scene understanding, while other investigations reveal difficulties in their ability to process relational information. To achieve widespread applicability, VLMs must perform reliably, yielding comparable competence across a wide variety of related tasks. We sought to test how reliable these architectures are at engaging in trivial spatial cognition, e.g., recognizing whether one object is left of another in an uncluttered scene. We developed a benchmark dataset -- TableTest -- whose images depict 3D scenes of objects arranged on a table, and used it to evaluate state-of-the-art VLMs. Results show that performance could be degraded by minor variations of prompts that use logically equivalent descriptions. These analyses suggest limitations in how VLMs may reason about spatial relations in real-world applications. They also reveal novel opportunities for bolstering image caption corpora for more efficient training and testing.

Vision language models are unreliable at trivial spatial cognition

TL;DR

This work probes the reliability of vision-language architectures in trivial spatial cognition tasks by introducing TableTest, a synthetic benchmark of spatial relations among objects on a table. It evaluates three recent architectures under multimodal and text-only prompts across eight prompt variants, revealing substantial prompt-dependent variability and, in many cases, performance far below reliable human-like baselines. The findings indicate systematic limitations in how these models reason about spatial relations, especially under negative or disjunctive prompts, and suggest that training data biases (e.g., sparse spatial relations in image captions) may underlie these failures. The paper advocates for broader, cross-prompt benchmarking and synthetic-data-driven approaches to bolstering training corpora, while cautioning against over-interpreting single-prompt success as general spatial competence.

Abstract

Vision language models (VLMs) are designed to extract relevant visuospatial information from images. Some research suggests that VLMs can exhibit humanlike scene understanding, while other investigations reveal difficulties in their ability to process relational information. To achieve widespread applicability, VLMs must perform reliably, yielding comparable competence across a wide variety of related tasks. We sought to test how reliable these architectures are at engaging in trivial spatial cognition, e.g., recognizing whether one object is left of another in an uncluttered scene. We developed a benchmark dataset -- TableTest -- whose images depict 3D scenes of objects arranged on a table, and used it to evaluate state-of-the-art VLMs. Results show that performance could be degraded by minor variations of prompts that use logically equivalent descriptions. These analyses suggest limitations in how VLMs may reason about spatial relations in real-world applications. They also reveal novel opportunities for bolstering image caption corpora for more efficient training and testing.

Paper Structure

This paper contains 27 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Examples from the TableTest dataset, which includes 64 individual objects (a) in various configurations (b, c). The dataset includes all 2-object configurations, such as one in which a blender is depicted to the right of a cupcake (b). It also includes all 3-object configurations, such as one in which a cupcake is to the left of a toaster, which is to the left of a blender (c).
  • Figure 2: Proportion accuracy from multimodal evaluations of VLM performance on spatial recognition prompts (see Table \ref{['table-prompts']}) in which each prompt was paired with an image from TableTest. Dashed lines in each panel depict overall accuracies of each VLM and density plots depict performance distributions across TableTest's 64 objects, as organized by whether the object served as "object A" in prompt templates and variations. Humanlike performance anticipated to be at ceiling (accuracy = 1.0); VLMs with humanlike performance will exhibit density patterns akin to that shown in Figure \ref{['figure-textonly']}g below.
  • Figure 3: Proportion accuracy from text-only evaluations of VLM performance on spatial recognition prompts (see Table \ref{['table-prompts']}) in which each prompt was paired with text that described a particular image from TableTest. Dashed lines in each panel depict overall accuracies of each VLM and density plots depict performance distributions across TableTest's 64 objects, as organized by whether the object served as "object A" in prompt templates and variations. Humanlike performance anticipated to be at ceiling (accuracy = 1.0).