Table of Contents
Fetching ...

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, Chenfei Liao, Dingcheng Zhen, Yuanhuiyi Lyu, Yuqian Fu, Bin Ren, Linfeng Zhang, Danda Pani Paudel, Nicu Sebe, Luc Van Gool, Xuming Hu

TL;DR

This survey addresses multimodal spatial reasoning in the large-model era, formalizing inputs $ \mathcal{X}=\{x^{\mathrm{img}},x^{\mathrm{vid}},x^{\mathrm{pc}},x^{\mathrm{aud}},x^{\mathrm{text}},\ldots\}$ under a reference frame (2D/3D/ego/allo) and outputs $\mathcal{Y}$ that include textual answers, geometric quantities, or executable actions. It systematizes tasks across 2D/3D grounding, VQA, navigation, scene generation, and embodied reasoning, and organizes progress along test-time scaling, post-training, architectural design, and explainability. The contributions encompass a taxonomy of spatial tasks, evaluation protocols, coverage of 3D and embodied AI, discussion of emerging modalities like audio and egocentric video, and open benchmarks with implementation details hosted at the project page. By cataloging methodologies (prompting, tool use, SFT, RL, input representations, and dedicated spatial modules) and benchmark ecosystems, the paper provides a foundation for standardized, reproducible evaluation and rapid progress toward grounded, multimodal spatial intelligence. Formally, the framework aligns cross-modal perception with spatial reasoning to unify traditional VQA, 3D grounding, navigation, and layout synthesis, enabling robust evaluation and cross-domain transfer across static, dynamic, and embodied contexts.

Abstract

Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

TL;DR

This survey addresses multimodal spatial reasoning in the large-model era, formalizing inputs under a reference frame (2D/3D/ego/allo) and outputs that include textual answers, geometric quantities, or executable actions. It systematizes tasks across 2D/3D grounding, VQA, navigation, scene generation, and embodied reasoning, and organizes progress along test-time scaling, post-training, architectural design, and explainability. The contributions encompass a taxonomy of spatial tasks, evaluation protocols, coverage of 3D and embodied AI, discussion of emerging modalities like audio and egocentric video, and open benchmarks with implementation details hosted at the project page. By cataloging methodologies (prompting, tool use, SFT, RL, input representations, and dedicated spatial modules) and benchmark ecosystems, the paper provides a foundation for standardized, reproducible evaluation and rapid progress toward grounded, multimodal spatial intelligence. Formally, the framework aligns cross-modal perception with spatial reasoning to unify traditional VQA, 3D grounding, navigation, and layout synthesis, enabling robust evaluation and cross-domain transfer across static, dynamic, and embodied contexts.

Abstract

Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.

Paper Structure

This paper contains 61 sections, 11 figures, 18 tables.

Figures (11)

  • Figure 1: (a) Various multimodal inputs for advanced spatial reasoning with MLLMs, such as 2D images su2025pixel, 3D scenes liu20253daxisprompt and videos ouyang2025spacer. (b) Downstream tasks base or rely on spatial reasoning, such as VLA du2025vl, 3D layout generation feng2023layoutgpt, and vision-language action gemini-robotics.
  • Figure 2: Taxonomy for multimodal spatial reasoning with large models.
  • Figure 3: Typical MLLM architecture and strategies.
  • Figure 4: An overview of core spatial reasoning tasks in 3D space, including 3D visual groundingwu2025datacheng2024spatialrgpt, 3D scene reasoningma2024spatialpinchen2024ll3da, and 3D generationocal2024scenetellerwu2024diorama.
  • Figure 5: 3D visual grounding with MLLM yang2024llm.
  • ...and 6 more figures