Table of Contents
Fetching ...

Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, Hui En Pang, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang

TL;DR

The paper tackles the slow progress of spatial intelligence in multimodal LLMs by introducing EASI, a holistic evaluation framework that unifies eight recent SI benchmarks under a six-capability taxonomy and standardized protocols. It demonstrates through extensive benchmarks that GPT-5 achieves new SI performance but remains far from human-level understanding on many tasks, with larger gaps in PT, DA, and CR. The study also shows that SI tasks are more deficiency-prone than non-SI tasks and that proprietary models do not consistently outperform open-source ones on the hardest subtasks. By releasing the EASI codebase and leaderboard, the work provides a reproducible foundation for cross-benchmark comparisons and accelerates community progress toward robust spatial intelligence.

Abstract

Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence (SI). We thus propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence. EASI conceptualizes a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and a growing collection of newly curated ones, enabling systematic evaluation of state-of-the-art models. In this report, we conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in SI, yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail the most advanced multimodal models. EASI is an ongoing community effort: we have open-sourced the EASI codebase that provides a one-stop and reproducible solution with standardized interfaces, integrated protocols and prompts that significantly reduce the friction of configuring and running multiple benchmarks; we have also launched an accompanying EASI leaderboard to provide a continually updated snapshot of model performance across the full SI spectrum, accelerating collective progress toward robust SI.

Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

TL;DR

The paper tackles the slow progress of spatial intelligence in multimodal LLMs by introducing EASI, a holistic evaluation framework that unifies eight recent SI benchmarks under a six-capability taxonomy and standardized protocols. It demonstrates through extensive benchmarks that GPT-5 achieves new SI performance but remains far from human-level understanding on many tasks, with larger gaps in PT, DA, and CR. The study also shows that SI tasks are more deficiency-prone than non-SI tasks and that proprietary models do not consistently outperform open-source ones on the hardest subtasks. By releasing the EASI codebase and leaderboard, the work provides a reproducible foundation for cross-benchmark comparisons and accelerates community progress toward robust spatial intelligence.

Abstract

Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence (SI). We thus propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence. EASI conceptualizes a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and a growing collection of newly curated ones, enabling systematic evaluation of state-of-the-art models. In this report, we conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in SI, yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail the most advanced multimodal models. EASI is an ongoing community effort: we have open-sourced the EASI codebase that provides a one-stop and reproducible solution with standardized interfaces, integrated protocols and prompts that significantly reduce the friction of configuring and running multiple benchmarks; we have also launched an accompanying EASI leaderboard to provide a continually updated snapshot of model performance across the full SI spectrum, accelerating collective progress toward robust SI.

Paper Structure

This paper contains 26 sections, 9 equations, 5 figures, 24 tables.

Figures (5)

  • Figure 1: While GPT-5 openai_gpt5_systemcard excels at solving complex non-spatial problems (left) that are considered challenging for humans, it surprisingly struggles with some of the most basic spatial intelligence tasks (right), which even a human child can comprehend effortlessly. GPT-5's detailed reasoning process for this case can be found at \ref{['appendix:case_study']}.
  • Figure 2: Six Fundamental Capabilities of Spatial Intelligence.
  • Figure 3: Case Study. We compare the performance of GPT-5 with thinking capability (GPT-5-thinking), the standard GPT-5 model, the previous strong thinking model GPT-o3 gpt_o3, and another leading reasoning model, Doubao-Seed-1.6-thinking seed2025seed1_5vl. While GPT-5-thinking exhibits notable improvements over its predecessors, it remains far from conquering the full spectrum of spatial intelligence. For MR2 and MR3, Doubao-Seed-1.6-thinking is exempted from visual comparisons because it cannot generate images. Note in this comparison, the web-based services are used. The reasoning output and more examples can be found in \ref{['appendix:case_study']}.
  • Figure 4: EASI Prompts for cross-benchmark comparison. Note only results reported with EASI Protocol uses EASI Prompts.
  • Figure 5: Cumulative distribution of token consumptions. The horizontal axis represents the token usage, whereas the vertical axis represents the cumulative percentage of questions. Left: internal reasoning tokens (not applicable to open-source models); Right: externalized reasoning tokens.