Table of Contents
Fetching ...

N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu, Zhenpeng Huang, Meng Yu, Dan Xu, Dong Yu

TL;DR

This work addresses the gap in vision-language models lacking intrinsic 3D perception by introducing N3D-VLM, a unified framework that performs native 3D object localization, grounding, and 3D spatial reasoning. It introduces a depth-aware, RGB-D architecture trained in two stages, supported by a large-scale 3D data generation pipeline that lifts 2D annotations into 3D and constructs 3D QA datasets. The authors demonstrate state-of-the-art performance on 3D grounding and 3D spatial reasoning across multiple benchmarks, and show that explicit 3D grounding enhances downstream reasoning. A new open benchmark, N3D-Bench, extends coverage to 264 object categories and multi-object reasoning with CoT, promoting more robust 3D VLM evaluation and generalization.

Abstract

While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.

N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

TL;DR

This work addresses the gap in vision-language models lacking intrinsic 3D perception by introducing N3D-VLM, a unified framework that performs native 3D object localization, grounding, and 3D spatial reasoning. It introduces a depth-aware, RGB-D architecture trained in two stages, supported by a large-scale 3D data generation pipeline that lifts 2D annotations into 3D and constructs 3D QA datasets. The authors demonstrate state-of-the-art performance on 3D grounding and 3D spatial reasoning across multiple benchmarks, and show that explicit 3D grounding enhances downstream reasoning. A new open benchmark, N3D-Bench, extends coverage to 264 object categories and multi-object reasoning with CoT, promoting more robust 3D VLM evaluation and generalization.

Abstract

While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.

Paper Structure

This paper contains 21 sections, 5 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Our unified vision-language model N3D-VLM performs native 3D grounding and subsequent spatial reasoning and answering. Given an RGB image and the corresponding text question, the model is capable of predicting 3D bounding boxes for specified objects and explicitly reasoning about spatial relations in 3D space.
  • Figure 2: Illustration of our data construction pipeline. We first lift annotations from existing 2D detection datasets with diverse object categories into 3D space, resulting in a large-scale and category-rich 3D detection annotation repository. Based on this repository, we generate data for 3D detection, 3D grounding, and 3D spatial reasoning QA tasks.
  • Figure 3: Illustration of our model design and quantitative comparison. (a) Overview of our model architecture and the cascaded spatial reasoning process. (b) Quantitative comparison showing that our model outperforms existing methods. (c) Definition of structured language representation for 3D bounding boxes.
  • Figure 4: Qualitative comparison of 3D grounding capability with Qwen3-VL-8B Qwen3-VL. Compared to Qwen3-VL-8B, our N3D-VLM generates 3D bounding boxes that more accurately close to the ground truth, reflecting stronger 3D understanding and localization precision. In the visualization, green boxes represent ground truth 3D bounding boxes, and red boxes indicate model’s predictions.
  • Figure 5: Qualitative comparison of 3D grounding capability with SpatialLM mao2025spatiallm and Qwen3-VL-8B Qwen3-VL in indoor scenes. our N3D-VLM accurately localizes objects such as pillows and washing machines, while baselines either miss objects or exhibit inaccurate prediction. In the visualization, green boxes represent ground truth 3D bounding boxes, and red boxes indicate model’s predictions.
  • ...and 6 more figures