Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics
Wenrui Xu, Dalin Lyu, Weihang Wang, Jie Feng, Chen Gao, Yong Li
TL;DR
<3-5 sentence high-level summary>This paper defines five Basic Spatial Abilities (BSAs) for Visual Language Models (VLMs) within a psychometric framework and benchmarks 13 models using nine standardized tests, comparing results to human baselines. It reveals a substantial gap ($24.95$ vs. $68.38$) yet shows that VLMs mirror human hierarchies across BSAs (best in Spatial Orientation, worst in Mental Rotation), with the BSAs appearing largely independent ($r<0.4$). The study also finds that model manufacturer and size do not strictly track performance, with compact models sometimes outperforming larger ones, and shows that intervention strategies like Chain-of-Thought ($+0.100$) and 5-shot training ($+0.259$) yield limited improvements due to architectural constraints. Finally, it identifies barriers such as weak geometry encoding and lack of dynamic spatial simulation, arguing for neurosymbolic and geometry-prior architectures to advance embodied spatial intelligence in AI systems.
Abstract
The Theory of Multiple Intelligences underscores the hierarchical nature of cognitive capabilities. To advance Spatial Artificial Intelligence, we pioneer a psychometric framework defining five Basic Spatial Abilities (BSAs) in Visual Language Models (VLMs): Spatial Perception, Spatial Relation, Spatial Orientation, Mental Rotation, and Spatial Visualization. Benchmarking 13 mainstream VLMs through nine validated psychometric experiments reveals significant gaps versus humans (average score 24.95 vs. 68.38), with three key findings: 1) VLMs mirror human hierarchies (strongest in 2D orientation, weakest in 3D rotation) with independent BSAs (Pearson's r<0.4); 2) Smaller models such as Qwen2-VL-7B surpass larger counterparts, with Qwen leading (30.82) and InternVL2 lagging (19.6); 3) Interventions like chain-of-thought (0.100 accuracy gain) and 5-shot training (0.259 improvement) show limits from architectural constraints. Identified barriers include weak geometry encoding and missing dynamic simulation. By linking psychometric BSAs to VLM capabilities, we provide a diagnostic toolkit for spatial intelligence evaluation, methodological foundations for embodied AI development, and a cognitive science-informed roadmap for achieving human-like spatial intelligence.
