Vision language models are unreliable at trivial spatial cognition
Sangeet Khemlani, Tyler Tran, Nathaniel Gyory, Anthony M. Harrison, Wallace E. Lawson, Ravenna Thielstrom, Hunter Thompson, Taaren Singh, J. Gregory Trafton
TL;DR
This work probes the reliability of vision-language architectures in trivial spatial cognition tasks by introducing TableTest, a synthetic benchmark of spatial relations among objects on a table. It evaluates three recent architectures under multimodal and text-only prompts across eight prompt variants, revealing substantial prompt-dependent variability and, in many cases, performance far below reliable human-like baselines. The findings indicate systematic limitations in how these models reason about spatial relations, especially under negative or disjunctive prompts, and suggest that training data biases (e.g., sparse spatial relations in image captions) may underlie these failures. The paper advocates for broader, cross-prompt benchmarking and synthetic-data-driven approaches to bolstering training corpora, while cautioning against over-interpreting single-prompt success as general spatial competence.
Abstract
Vision language models (VLMs) are designed to extract relevant visuospatial information from images. Some research suggests that VLMs can exhibit humanlike scene understanding, while other investigations reveal difficulties in their ability to process relational information. To achieve widespread applicability, VLMs must perform reliably, yielding comparable competence across a wide variety of related tasks. We sought to test how reliable these architectures are at engaging in trivial spatial cognition, e.g., recognizing whether one object is left of another in an uncluttered scene. We developed a benchmark dataset -- TableTest -- whose images depict 3D scenes of objects arranged on a table, and used it to evaluate state-of-the-art VLMs. Results show that performance could be degraded by minor variations of prompts that use logically equivalent descriptions. These analyses suggest limitations in how VLMs may reason about spatial relations in real-world applications. They also reveal novel opportunities for bolstering image caption corpora for more efficient training and testing.
