Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

Yifan Liu; Fangneng Zhan; Kaichen Zhou; Yilun Du; Paul Pu Liang; Hanspeter Pfister

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

Yifan Liu, Fangneng Zhan, Kaichen Zhou, Yilun Du, Paul Pu Liang, Hanspeter Pfister

TL;DR

This work tackles the 3D reasoning gap in vision-language models by introducing SandboxVLM, a training-free framework that injects abstract 3D structure into existing VLMs through proxy elevation and multi-view abstraction. By generating informative multi-view priors, lifting objects into coarse 3D proxies, and clustering them into oriented 3D bounding boxes, SandboxVLM provides a compact, interpretable 3D context that enables zero-shot spatial and physical reasoning. Across diverse benchmarks and backbones, it achieves substantial gains over baselines, including notable improvements on SAT Real and PhysBench, demonstrating that coarse 3D abstractions can dramatically enhance embodied understanding without dense 3D supervision or retraining. The results suggest a scalable path toward improving generalized embodied intelligence by leveraging abstract geometric priors in large multimodal models.

Abstract

Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3\% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

TL;DR

Abstract

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)