Table of Contents
Fetching ...

pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning

Zhanpeng Luo, Ce Zhang, Silong Yong, Cunxi Dai, Qianwei Wang, Haoxi Ran, Guanya Shi, Katia Sycara, Yaqi Xie

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach.

pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach.
Paper Structure (21 sections, 1 equation, 6 figures, 8 tables)

This paper contains 21 sections, 1 equation, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparing our pySpatial with spatial mental models for multi-view spatial reasoning tasks. Unlike spatial mental models yin2025spatial, which rely on the implicit imagination of MLLMs to construct a 2D cognitive map, we introduce pySpatial, a visual programming framework that flexibly composes spatial tools (e.g., 3D reconstruction, camera movements, and novel view synthesis) to enable MLLMs to explicitly reason in 3D space for diverse spatial reasoning tasks.
  • Figure 2: Qualitative results on four representative examples from MindCube. We show that pySpatial enables MLLMs to explicitly reason within the reconstructed explorable 3D scene, effectively addressing diverse spatial reasoning tasks through interpretable and executable 3D visual programs. Figure \ref{['fig:moreexample']} further illustrates that pySpatial is capable of composing executable 3D visual programs with control flow constructs (e.g., for-loops), allowing it to robustly address a wide range of spatial reasoning tasks. Best viewed when zoomed in.
  • Figure 3: Qualitative results on real-world robot navigation. We deploy pySpatial on a Unitree-Go1 robot to navigate toward a target object (mushroom toy) using limited views as input. Compared to the GPT-4.1 baseline, which fails due to an incorrect initial turn and produces a collision-prone trajectory, pySpatial generates a geometrically consistent plan that successfully reaches the goal.
  • Figure 4: Failure case study. We manually examine the error sources in about 100 samples from MindCube.
  • Figure A1: More qualitative examples from MindCube. We show that pySpatial enables MLLMs to explicitly reason within a reconstructed, explorable 3D scene, allowing the model not only to interpret spatial structure but also to compose executable 3D visual programs with control flow, such as for-loops to robustly solve diverse spatial reasoning tasks.
  • ...and 1 more figures