Table of Contents
Fetching ...

Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs

Fangrui Zhu, Hanhui Wang, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, Huaizu Jiang

TL;DR

This work addresses whether Multimodal LLMs can perform 3D spatial reasoning without explicit 3D inputs by leveraging structured 2D perception outputs. It introduces Struct2D, a prompting framework that converts RGB-D perception into BEV images, object marks, and object metadata (with optional keyframes) to guide reasoning. Through zero-shot analysis of GPT-o3 and large-scale instruction tuning on Struct2D-Set, the approach demonstrates strong spatial reasoning capabilities and yields competitive results on VSI-Bench, 3D QA, dense captioning, and grounding. The findings suggest structured 2D representations can bridge perception and language reasoning in MLLMs, offering a scalable path toward robust embodied understanding, with code and data to support ongoing research.

Abstract

Unlocking spatial reasoning in Multimodal Large Language Models (MLLMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can MLLMs reason about 3D space using only structured 2D representations derived from perception? We introduce Struct2D, a perception-guided prompting framework that combines bird's-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source MLLMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct Struct2D-Set, a large-scale instruction tuning dataset with 200K fine-grained QA pairs across eight spatial reasoning categories, generated automatically from 3D indoor scenes. We fine-tune an open-source MLLM (Qwen2.5VL) on Struct2D-Set, achieving competitive performance on multiple benchmarks, including 3D question answering, dense captioning, and object grounding. Our approach demonstrates that structured 2D inputs can effectively bridge perception and language reasoning in MLLMs-without requiring explicit 3D representations as input. We will release both our code and dataset to support future research.

Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs

TL;DR

This work addresses whether Multimodal LLMs can perform 3D spatial reasoning without explicit 3D inputs by leveraging structured 2D perception outputs. It introduces Struct2D, a prompting framework that converts RGB-D perception into BEV images, object marks, and object metadata (with optional keyframes) to guide reasoning. Through zero-shot analysis of GPT-o3 and large-scale instruction tuning on Struct2D-Set, the approach demonstrates strong spatial reasoning capabilities and yields competitive results on VSI-Bench, 3D QA, dense captioning, and grounding. The findings suggest structured 2D representations can bridge perception and language reasoning in MLLMs, offering a scalable path toward robust embodied understanding, with code and data to support ongoing research.

Abstract

Unlocking spatial reasoning in Multimodal Large Language Models (MLLMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can MLLMs reason about 3D space using only structured 2D representations derived from perception? We introduce Struct2D, a perception-guided prompting framework that combines bird's-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source MLLMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct Struct2D-Set, a large-scale instruction tuning dataset with 200K fine-grained QA pairs across eight spatial reasoning categories, generated automatically from 3D indoor scenes. We fine-tune an open-source MLLM (Qwen2.5VL) on Struct2D-Set, achieving competitive performance on multiple benchmarks, including 3D question answering, dense captioning, and object grounding. Our approach demonstrates that structured 2D inputs can effectively bridge perception and language reasoning in MLLMs-without requiring explicit 3D representations as input. We will release both our code and dataset to support future research.

Paper Structure

This paper contains 17 sections, 3 equations, 13 figures, 7 tables, 1 algorithm.

Figures (13)

  • Figure 1: Overview of our Struct2D framework for enabling spatial reasoning in Multimodal Large Language Models (MLLMs). From an RGB-D video, we generate structured 2D inputs—BEV images with filtered object marks, object-centric metadata, and optional keyframes—via a 3D perception module. These inputs prompt an MLLM with spatial priors and visual context, enabling diverse spatial reasoning tasks without explicit 3D input at inference.
  • Figure 2: Illustration of Struct2D prompting. Given an egocentric video, we first reconstruct a point cloud and detect 3D objects. A bird’s-eye-view (BEV) image is rendered and drawn with object marks related with the question. To facilitate reasoning about relative directions, the BEV is rotated to align with the agent’s facing direction. We further construct object-centric metadata and a structured guide prompt to support the model in understanding spatial relationships between objects.
  • Figure 3: Distribution of question types in the selected VSI-Bench subset. This follows the distribution of the full set.
  • Figure 4: Distribution of QA types in Struct2D-Set. The dataset covers a diverse range of spatial reasoning skills, with a focus on spatial relationships and localization tasks that require strong geometric understanding.
  • Figure 5: QA examples of Struct2D-Set. Examples cover diverse spatial reasoning tasks, including object attributes, counting, relative positioning, navigation, and comparative reasoning. Each QA pair includes a short answer from 3D geometry and an augmented answer with detailed reasoning generated by ChatGPT.
  • ...and 8 more figures