Table of Contents
Fetching ...

3D Aware Region Prompted Vision Language Model

An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, Sifei Liu

TL;DR

SR-3D introduces a unified 3D-aware vision–language model that grounds single-view 2D and multi-view 3D data in a shared visual token space by injecting canonical 3D positional embeddings into a 2D VLM and employing a dynamic tiling-based region extractor. It enables flexible region prompting across frames, supports depth-based back-projection, and aligns multi-view data into a common canonical space for robust spatial reasoning without dense 3D annotations. Extensive experiments demonstrate state-of-the-art performance on 2D VLM benchmarks, 3D QA and video spatial benchmarks, and strong zero-shot generalization to in-the-wild videos. The work highlights practical benefits for 3D scene understanding and region-based spatial queries with scalable training and inference pipelines.

Abstract

We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.

3D Aware Region Prompted Vision Language Model

TL;DR

SR-3D introduces a unified 3D-aware vision–language model that grounds single-view 2D and multi-view 3D data in a shared visual token space by injecting canonical 3D positional embeddings into a 2D VLM and employing a dynamic tiling-based region extractor. It enables flexible region prompting across frames, supports depth-based back-projection, and aligns multi-view data into a common canonical space for robust spatial reasoning without dense 3D annotations. Extensive experiments demonstrate state-of-the-art performance on 2D VLM benchmarks, 3D QA and video spatial benchmarks, and strong zero-shot generalization to in-the-wild videos. The work highlights practical benefits for 3D scene understanding and region-based spatial queries with scalable training and inference pipelines.

Abstract

We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.

Paper Structure

This paper contains 24 sections, 6 figures, 16 tables.

Figures (6)

  • Figure 1: From precise region-based distance estimation (left), to intricate multi-view region query (middle), and global cross-frame reasoning (right), SR-3D delivers flexible and accurate spatial understanding to foundational Vision-Language Models. Notably, this video is obtained in the wild, without sensory 3D inputs, showcasing the remarkable generalization capability of our model.
  • Figure 2: The SR-3D architecture. Given an image or multi-view input with optional region prompts (e.g., bounding boxes or masks), we encode them along with depth-derived positional embeddings using a tiling approach. Region tokens are extracted by stitching masked features, while 3D positional embeddings are mapped to a shared canonical space in the multi-view setting, as shown on the bottom right.
  • Figure 3: RealWorldQA results. SR-3D shows stronger spatial understanding of physical environments compared to the base model. We omit the answer choices for clarity in visualization.
  • Figure 4: SR-3D results on region-level multi-view spatial understanding. We show extreme cases where the same region prompts are used across samples but with different target objects. SR-3D answers all queries correctly, showing strong evidence that it truly understands 3D spatial relationships.
  • Figure 5: VSI-Bench results. We highlight SR-3D’s outputs and include ground-truth values for numerical answers. The results show that SR-3D answers spatial questions correctly even without region prompts.
  • ...and 1 more figures