Table of Contents
Fetching ...

Imagine in Space: Exploring the Frontier of Spatial Intelligence and Reasoning Efficiency in Vision Language Models

Xiaoxing Lian, Aidong Yang, Jun Zhu, Peng Wang, Yue Zhang

TL;DR

This work introduces SpatiaLite, a fully synthetic benchmark designed to quantify both accuracy and efficiency of spatial reasoning in vision-language systems, and hypothesizes that imagination drives a spatial world model. It reveals a substantial reliance on linguistic representations for spatial tasks, with pronounced weaknesses on true visual-centric transformations like mental rotation and 3D predictions, plus severe inefficiency as task complexity grows. To address these gaps, the authors propose the Imagery-Driven Framework (IDF), a two-stage imagery-distillation pipeline that generates large-scale visual-imagery data to implicitly construct an internal spatial world model and improve reasoning. The findings highlight the limits of current VLMs in visual-spatial cognition and offer a concrete data- and training-driven path toward more robust spatial intelligence, with potential impact on robotics and autonomous systems.

Abstract

Large language models (LLMs) and vision language models (VLMs), such as DeepSeek R1,OpenAI o3, and Gemini 2.5 Pro, have demonstrated remarkable reasoning capabilities across logical inference, problem solving, and decision making. However, spatial reasoning:a fundamental component of human cognition that includes mental rotation, navigation, and spatial relationship comprehension remains a significant challenge for current advanced VLMs. We hypothesize that imagination, the internal simulation of spatial states, is the dominant reasoning mechanism within a spatial world model. To test this hypothesis and systematically probe current VLM spatial reasoning mechanisms, we introduce SpatiaLite, a fully synthetic benchmark that jointly measures spatial reasoning accuracy and reasoning efficiency. Comprehensive experiments reveal three key findings. First, advanced VLMs predominantly rely on linguistic representations for reasoning and imagination, resulting in significant deficiencies on visual centric tasks that demand perceptual spatial relations and 3D geometry transformations such as mental rotation or projection prediction. Second, advanced VLMs exhibit severe inefficiency in their current spatial reasoning mechanisms, with token usage growing rapidly as transformation complexity increases. Third, we propose an Imagery Driven Framework (IDF) for data synthesis and training, which can implicitly construct an internal world model that is critical for spatial reasoning in VLMs. Building on SpatiaLite, this work delineates the spatial reasoning limits and patterns of advanced VLMs, identifies key shortcomings, and informs future advances

Imagine in Space: Exploring the Frontier of Spatial Intelligence and Reasoning Efficiency in Vision Language Models

TL;DR

This work introduces SpatiaLite, a fully synthetic benchmark designed to quantify both accuracy and efficiency of spatial reasoning in vision-language systems, and hypothesizes that imagination drives a spatial world model. It reveals a substantial reliance on linguistic representations for spatial tasks, with pronounced weaknesses on true visual-centric transformations like mental rotation and 3D predictions, plus severe inefficiency as task complexity grows. To address these gaps, the authors propose the Imagery-Driven Framework (IDF), a two-stage imagery-distillation pipeline that generates large-scale visual-imagery data to implicitly construct an internal spatial world model and improve reasoning. The findings highlight the limits of current VLMs in visual-spatial cognition and offer a concrete data- and training-driven path toward more robust spatial intelligence, with potential impact on robotics and autonomous systems.

Abstract

Large language models (LLMs) and vision language models (VLMs), such as DeepSeek R1,OpenAI o3, and Gemini 2.5 Pro, have demonstrated remarkable reasoning capabilities across logical inference, problem solving, and decision making. However, spatial reasoning:a fundamental component of human cognition that includes mental rotation, navigation, and spatial relationship comprehension remains a significant challenge for current advanced VLMs. We hypothesize that imagination, the internal simulation of spatial states, is the dominant reasoning mechanism within a spatial world model. To test this hypothesis and systematically probe current VLM spatial reasoning mechanisms, we introduce SpatiaLite, a fully synthetic benchmark that jointly measures spatial reasoning accuracy and reasoning efficiency. Comprehensive experiments reveal three key findings. First, advanced VLMs predominantly rely on linguistic representations for reasoning and imagination, resulting in significant deficiencies on visual centric tasks that demand perceptual spatial relations and 3D geometry transformations such as mental rotation or projection prediction. Second, advanced VLMs exhibit severe inefficiency in their current spatial reasoning mechanisms, with token usage growing rapidly as transformation complexity increases. Third, we propose an Imagery Driven Framework (IDF) for data synthesis and training, which can implicitly construct an internal world model that is critical for spatial reasoning in VLMs. Building on SpatiaLite, this work delineates the spatial reasoning limits and patterns of advanced VLMs, identifies key shortcomings, and informs future advances

Paper Structure

This paper contains 33 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Spatial World Model: Imagination as the core mechanism with two forms - linguistic and visual imagery
  • Figure 2: SpatiaLite Benchmark Task Category
  • Figure 3: Mental rotation presents VLMs with eight isometric viewpoints of a 3D irregular structure made of colored cubes and ask to infer from one orthogonal viewpoint.
  • Figure 4: Complexity setting for each task and difficulty level.
  • Figure 5: Accuracy comparison across different tasks and difficulty levels
  • ...and 2 more figures