ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

Yanpeng Zhao; Wentao Ding; Hongtao Li; Baoxiong Jia; Zilong Zheng

ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

Yanpeng Zhao, Wentao Ding, Hongtao Li, Baoxiong Jia, Zilong Zheng

Abstract

A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.

ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

Abstract

Paper Structure (53 sections, 1 equation, 10 figures, 20 tables)

This paper contains 53 sections, 1 equation, 10 figures, 20 tables.

Introduction
Related Work
Spatial reasoning with vision-language models.
Simulation-based evaluation through robotic tasks.
Foundation models for robotics manipulation.
6-DoF object rearrangement.
Spatial-centric Evaluation of Embodied VLMs
The Espire Benchmark
Spatial Reasoning Tasks
Task specification.
Instruction representation.
Instruction families.
Simulation Environment
Environment representation and generation.
Reducing the real-to-sim visual gaps.
...and 38 more sections

Figures (10)

Figure 1: Espire: a simulated physical world. Top: the spatial world of Espire covers key factors of spatial reasoning like spatial aspects (e.g., relationship and distance), reference frames, reference objects (§\ref{['sec:espire-task']}). It features a tabletop scene for pick tasks and a shelf scene for place tasks (§\ref{['sec:espire-env']}) and supports reasoning at varying granularities (see Table \ref{['tab:spatial-aspect']} in Appendix \ref{['sup:env']}). Bottom: example Espire tasks that all inherently rely on spatial reasoning.
Figure 2: Localization performance across spatial aspects and granularities on pick tasks.
Figure 3: Layouts of the tabletop and shelf scenes within Espire (best viewed in color). The light red region denotes the camera viewpoint sampling area, the light green region indicates where the robot end effector may appear, and the light blue region denotes where distant reference objects are placed. All labeled dimensions are in meters.
Figure 4: Example prompts with Qwen3-VL (continued).
Figure 5: Example prompts with Qwen3-VL.
...and 5 more figures

ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

Abstract

ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

Authors

Abstract

Table of Contents

Figures (10)