Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs
Fangrui Zhu, Hanhui Wang, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, Huaizu Jiang
TL;DR
This work addresses whether Multimodal LLMs can perform 3D spatial reasoning without explicit 3D inputs by leveraging structured 2D perception outputs. It introduces Struct2D, a prompting framework that converts RGB-D perception into BEV images, object marks, and object metadata (with optional keyframes) to guide reasoning. Through zero-shot analysis of GPT-o3 and large-scale instruction tuning on Struct2D-Set, the approach demonstrates strong spatial reasoning capabilities and yields competitive results on VSI-Bench, 3D QA, dense captioning, and grounding. The findings suggest structured 2D representations can bridge perception and language reasoning in MLLMs, offering a scalable path toward robust embodied understanding, with code and data to support ongoing research.
Abstract
Unlocking spatial reasoning in Multimodal Large Language Models (MLLMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can MLLMs reason about 3D space using only structured 2D representations derived from perception? We introduce Struct2D, a perception-guided prompting framework that combines bird's-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source MLLMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct Struct2D-Set, a large-scale instruction tuning dataset with 200K fine-grained QA pairs across eight spatial reasoning categories, generated automatically from 3D indoor scenes. We fine-tune an open-source MLLM (Qwen2.5VL) on Struct2D-Set, achieving competitive performance on multiple benchmarks, including 3D question answering, dense captioning, and object grounding. Our approach demonstrates that structured 2D inputs can effectively bridge perception and language reasoning in MLLMs-without requiring explicit 3D representations as input. We will release both our code and dataset to support future research.
