SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors

Chenyang Ma; Kai Lu; Ta-Ying Cheng; Niki Trigoni; Andrew Markham

SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors

Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, Andrew Markham

TL;DR

This work presents SpatialPIN, a framework designed to enhance the spatial reasoning capabilities of VLMs through prompting and interacting with priors from multiple 3D foundation models in a zero-shot, training-free manner.

Abstract

Current state-of-the-art spatial reasoning-enhanced VLMs are trained to excel at spatial visual question answering (VQA). However, we believe that higher-level 3D-aware tasks, such as articulating dynamic scene changes and motion planning, require a fundamental and explicit 3D understanding beyond current spatial VQA datasets. In this work, we present SpatialPIN, a framework designed to enhance the spatial reasoning capabilities of VLMs through prompting and interacting with priors from multiple 3D foundation models in a zero-shot, training-free manner. Extensive experiments demonstrate that our spatial reasoning-imbued VLM performs well on various forms of spatial VQA and can extend to help in various downstream robotics tasks such as pick and stack and trajectory planning.

SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors

TL;DR

Abstract

Paper Structure (20 sections, 5 equations, 9 figures, 10 tables)

This paper contains 20 sections, 5 equations, 9 figures, 10 tables.

Introduction
Related Work
Method
2D Image Scene Understanding
Coarse 3D Scene Understanding
Fine-Grained 3D Scene Understanding
Combining External Tools for Downstream Tasks
Experiments
Spatial Visual Question Answering
Robotics Pick and Stack
Discovering and Planning for Robotics Tasks from a Single Image
Ablation Study
Discussion and Conclusion
Overview
Partial 3D Scene Reconstruction Details
...and 5 more sections

Figures (9)

Figure 1: We present SpatialPIN, a framework to enhance the spatial reasoning capabilities of VLMs through prompting and interacting with 3D priors in a zero-shot, training-free manner.
Figure 2: SpatialPIN. Our plug-and-play framework is fully modularized and designed for zero-shot deployment. Each module can be easily replaced with the latest updates. Exact prompts for VLMs are in Appendix.
Figure 3: Our method of partial 3D scene reconstruction (a). The reconstructed scene (b) and the input image (c) show high alignment.
Figure 4: Qualitative examples of spatial VQA. SpatialPIN outputs answers with fine-grained 3D reasoning. Zoom in for better view.
Figure 5: Qualitative examples of pick and stack (top) and task trajectory planning (bottom). SpatialPIN successfully outputs picking and stacking policies using spatial reasoning and plans 3D trajectories with geometric awareness to align with task descriptions.
...and 4 more figures

SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors

TL;DR

Abstract

SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors

Authors

TL;DR

Abstract

Table of Contents

Figures (9)