Table of Contents
Fetching ...

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

Yifan Liu, Fangneng Zhan, Kaichen Zhou, Yilun Du, Paul Pu Liang, Hanspeter Pfister

TL;DR

This work tackles the 3D reasoning gap in vision-language models by introducing SandboxVLM, a training-free framework that injects abstract 3D structure into existing VLMs through proxy elevation and multi-view abstraction. By generating informative multi-view priors, lifting objects into coarse 3D proxies, and clustering them into oriented 3D bounding boxes, SandboxVLM provides a compact, interpretable 3D context that enables zero-shot spatial and physical reasoning. Across diverse benchmarks and backbones, it achieves substantial gains over baselines, including notable improvements on SAT Real and PhysBench, demonstrating that coarse 3D abstractions can dramatically enhance embodied understanding without dense 3D supervision or retraining. The results suggest a scalable path toward improving generalized embodied intelligence by leveraging abstract geometric priors in large multimodal models.

Abstract

Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3\% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

TL;DR

This work tackles the 3D reasoning gap in vision-language models by introducing SandboxVLM, a training-free framework that injects abstract 3D structure into existing VLMs through proxy elevation and multi-view abstraction. By generating informative multi-view priors, lifting objects into coarse 3D proxies, and clustering them into oriented 3D bounding boxes, SandboxVLM provides a compact, interpretable 3D context that enables zero-shot spatial and physical reasoning. Across diverse benchmarks and backbones, it achieves substantial gains over baselines, including notable improvements on SAT Real and PhysBench, demonstrating that coarse 3D abstractions can dramatically enhance embodied understanding without dense 3D supervision or retraining. The results suggest a scalable path toward improving generalized embodied intelligence by leveraging abstract geometric priors in large multimodal models.

Abstract

Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3\% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.

Paper Structure

This paper contains 13 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Motivation of SandboxVLM. (a) Existing VLMs are trained without 3D awareness. Training or supervised fine-tuning (SFT) VLMs with 3D suffers from a lack of 3D data and forgetting. (b) Humans, however, reason effectively in 3D through coarse, relational understanding. (c) SandboxVLM follows this principle of abstract perception, providing a coarse but informative 3D context for zero-shot VLM reasoning.
  • Figure 2: Overview of the SandboxVLM pipeline. Given an input image and a textual query, the system builds a compact, 3D-aware, query-conditioned context for a vision-language model (VLM). A video diffusion prior first expands the input into a short multi-view sequence along imagined trajectories guided by abstract control provided by the VLM. Inside the 3D Sandbox module, an off-the-shelf depth estimator predicts per-frame depth and camera parameters, while the VLM identifies task-relevant objects that guide a 2D segmenter to produce instance masks. The masked regions are lifted into coarse 3D proxies and merged across views through a Multi-View Voting and Clustering step to form abstract 3D bounding boxes. Finally, informative renderings of these 3D abstractions are composed with the query and fed back into the VLM for spatial and physical reasoning.
  • Figure 3: Core modules of 3D Sandbox. (a) Proxy Elevation: The VLM identifies task-relevant objects and their approximate locations. A segmentation model produces object masks, followed by mask erosion and farthest point sampling to select interior proxy pixels. (b) Multi-View Voting: The proxies are unprojected into 3D space and aggregated across views through a cross-view consistency check (“Agree to”) to filter unreliable points. The remaining proxies will be clustered into boxes.
  • Figure 4: Visualization of representations in ablation study. (a) Input image to the system; (b) Scene graph generated by expert model; (c) Reconstructed point cloud rendering. (d) Text description of 3D bounding boxes; (e) Rendered proxy points; (f) 3D Sandbox. 3D Sandbox strikes a balance between informativeness and interpretability, providing vivid spatial cues while filtering out irrelevant details.