Table of Contents
Fetching ...

Spatial Reasoning in Multimodal Large Language Models: A Survey of Tasks, Benchmarks and Methods

Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, Wei Gao

TL;DR

This survey addresses the longstanding challenge of spatial reasoning in multimodal large language models (MLLMs) by proposing a cognitive-function-based taxonomy that transcends input modality. It systematically maps existing datasets and benchmarks to five cognitive categories and four levels of reasoning, analyzes evaluation metrics (including geometry-aware measures and human judgments), and reviews training- and inference-based methods to enhance spatial understanding. The authors identify key gaps—dominance of relational static tasks, limited metric reasoning, and weaknesses in dynamic and cross-view reasoning—and propose future directions: richer 3D representations, cognitively grounded benchmarks, and joint multi-modal training to foster grounded, persistent spatial world models. Overall, the paper provides a principled framework and actionable guidance for advancing spatial intelligence in embodied AI systems.

Abstract

Spatial reasoning, which requires ability to perceive and manipulate spatial relationships in the 3D world, is a fundamental aspect of human intelligence, yet remains a persistent challenge for Multimodal large language models (MLLMs). While existing surveys often categorize recent progress based on input modality (e.g., text, image, video, or 3D), we argue that spatial ability is not solely determined by the input format. Instead, our survey introduces a taxonomy that organizes spatial intelligence from cognitive aspect and divides tasks in terms of reasoning complexity, linking them to several cognitive functions. We map existing benchmarks across text only, vision language, and embodied settings onto this taxonomy, and review evaluation metrics and methodologies for assessing spatial reasoning ability. This cognitive perspective enables more principled cross-task comparisons and reveals critical gaps between current model capabilities and human-like reasoning. In addition, we analyze methods for improving spatial ability, spanning both training-based and reasoning-based approaches. This dual perspective analysis clarifies their respective strengths, uncovers complementary mechanisms. By surveying tasks, benchmarks, and recent advances, we aim to provide new researchers with a comprehensive understanding of the field and actionable directions for future research.

Spatial Reasoning in Multimodal Large Language Models: A Survey of Tasks, Benchmarks and Methods

TL;DR

This survey addresses the longstanding challenge of spatial reasoning in multimodal large language models (MLLMs) by proposing a cognitive-function-based taxonomy that transcends input modality. It systematically maps existing datasets and benchmarks to five cognitive categories and four levels of reasoning, analyzes evaluation metrics (including geometry-aware measures and human judgments), and reviews training- and inference-based methods to enhance spatial understanding. The authors identify key gaps—dominance of relational static tasks, limited metric reasoning, and weaknesses in dynamic and cross-view reasoning—and propose future directions: richer 3D representations, cognitively grounded benchmarks, and joint multi-modal training to foster grounded, persistent spatial world models. Overall, the paper provides a principled framework and actionable guidance for advancing spatial intelligence in embodied AI systems.

Abstract

Spatial reasoning, which requires ability to perceive and manipulate spatial relationships in the 3D world, is a fundamental aspect of human intelligence, yet remains a persistent challenge for Multimodal large language models (MLLMs). While existing surveys often categorize recent progress based on input modality (e.g., text, image, video, or 3D), we argue that spatial ability is not solely determined by the input format. Instead, our survey introduces a taxonomy that organizes spatial intelligence from cognitive aspect and divides tasks in terms of reasoning complexity, linking them to several cognitive functions. We map existing benchmarks across text only, vision language, and embodied settings onto this taxonomy, and review evaluation metrics and methodologies for assessing spatial reasoning ability. This cognitive perspective enables more principled cross-task comparisons and reveals critical gaps between current model capabilities and human-like reasoning. In addition, we analyze methods for improving spatial ability, spanning both training-based and reasoning-based approaches. This dual perspective analysis clarifies their respective strengths, uncovers complementary mechanisms. By surveying tasks, benchmarks, and recent advances, we aim to provide new researchers with a comprehensive understanding of the field and actionable directions for future research.

Paper Structure

This paper contains 36 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The gap between language intelligence and spatial intelligence for MLLMs: (a) Language tasks rely on discrete and sequential token processing, while spatial tasks require grounded reasoning in continuous 3D space. (b) This mismatch reflects the representation-level grounding problem—MLLMs model statistical co-occurrence rather than true geometric relations.
  • Figure 2: Spatial tasks for different application domains
  • Figure 3: Taxonomy of our survey. We introduce a cognitive taxonomy of spatial reasoning tasks, organizing them by function and reasoning complexity. We also map existing benchmarks, review evaluation metrics, and analyze training- and reasoning-based methods to improve spatial ability. The study highlights key gaps and future directions toward developing models with more human-like spatial intelligence.
  • Figure 4: Illustration of cognitive dimensions: Spatial reasoning can be decomposed along three cognitive dimensions: frame of reference (intrinsic vs. extrinsic), type of information (qualitative vs. quantitative), and nature of the task (static vs. dynamic). Each dimension reflects a distinct way humans and models encode, compare, or transform spatial relations.
  • Figure 5: Illustrative Examples for the Cognitive and Complexity-Based Taxonomy: This figure maps representative spatial reasoning tasks across five cognitive categories (x-axis) and four levels of reasoning complexity (y-axis). The taxonomy progresses from direct perception to advanced synthetic reasoning, distinguishing intrinsic vs. extrinsic, static vs. dynamic, and qualitative vs. quantitative cognition. Together, it illustrates how task complexity and cognitive function jointly define the difficulty and nature of spatial reasoning challenges for MLLMs.
  • ...and 2 more figures