Table of Contents
Fetching ...

SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

Pingyi Chen, Yujing Lou, Shen Cao, Jinhui Guo, Lubin Fan, Yue Wu, Lin Yang, Lizhuang Ma, Jieping Ye

TL;DR

This paper identifies a gap in 3D quantitative spatial reasoning for vision-language models and introduces MSMU, a large-scale dataset with precise metric annotations, plus MSMU-Bench for robust evaluation. It further proposes Depth Positional Encoding to fuse depth information into VLMs, enabling explicit 3D spatial priors without complex 3D inputs. Trained SD-VLM on MSMU, the approach achieves state-of-the-art performance on MSMU-Bench and strong generalization to SpatialRGPT-Bench and Q-Spatial++, demonstrating improved 3D spatial reasoning while maintaining broad VQA capabilities. The work provides a practical pathway toward embodied AI applications by equipping models with accurate spatial understanding through data and a lightweight depth integration method.

Abstract

While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains under-explored, due to the deficiency of 2D images' spatial representation ability. In this paper, we analyze the problem hindering VLMs' spatial understanding abilities and propose SD-VLM, a novel framework that significantly enhances fundamental spatial perception abilities of VLMs through two key contributions: (1) propose Massive Spatial Measuring and Understanding (MSMU) dataset with precise spatial annotations, and (2) introduce a simple depth positional encoding method strengthening VLMs' spatial awareness. MSMU dataset covers massive quantitative spatial tasks with 700K QA pairs, 2.5M physical numerical annotations, and 10K chain-of-thought augmented samples. We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability. SD-VLM not only achieves state-of-the-art performance on our proposed MSMU-Bench, but also shows spatial generalization abilities on other spatial understanding benchmarks including Q-Spatial and SpatialRGPT-Bench. Extensive experiments demonstrate that SD-VLM outperforms GPT-4o and Intern-VL3-78B by 26.91% and 25.56% respectively on MSMU-Bench. Code and models are released at https://github.com/cpystan/SD-VLM.

SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

TL;DR

This paper identifies a gap in 3D quantitative spatial reasoning for vision-language models and introduces MSMU, a large-scale dataset with precise metric annotations, plus MSMU-Bench for robust evaluation. It further proposes Depth Positional Encoding to fuse depth information into VLMs, enabling explicit 3D spatial priors without complex 3D inputs. Trained SD-VLM on MSMU, the approach achieves state-of-the-art performance on MSMU-Bench and strong generalization to SpatialRGPT-Bench and Q-Spatial++, demonstrating improved 3D spatial reasoning while maintaining broad VQA capabilities. The work provides a practical pathway toward embodied AI applications by equipping models with accurate spatial understanding through data and a lightweight depth integration method.

Abstract

While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains under-explored, due to the deficiency of 2D images' spatial representation ability. In this paper, we analyze the problem hindering VLMs' spatial understanding abilities and propose SD-VLM, a novel framework that significantly enhances fundamental spatial perception abilities of VLMs through two key contributions: (1) propose Massive Spatial Measuring and Understanding (MSMU) dataset with precise spatial annotations, and (2) introduce a simple depth positional encoding method strengthening VLMs' spatial awareness. MSMU dataset covers massive quantitative spatial tasks with 700K QA pairs, 2.5M physical numerical annotations, and 10K chain-of-thought augmented samples. We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability. SD-VLM not only achieves state-of-the-art performance on our proposed MSMU-Bench, but also shows spatial generalization abilities on other spatial understanding benchmarks including Q-Spatial and SpatialRGPT-Bench. Extensive experiments demonstrate that SD-VLM outperforms GPT-4o and Intern-VL3-78B by 26.91% and 25.56% respectively on MSMU-Bench. Code and models are released at https://github.com/cpystan/SD-VLM.

Paper Structure

This paper contains 36 sections, 11 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Demonstration of VQA pairs in MSMU. Our proposed dataset covers a range of quantitative spatial tasks involving multiple objects in the scene.
  • Figure 2: Overview of the data generation pipeline of MSMU. It consists of scene graph construction, 3D to 2D mapping, and QA generation.
  • Figure 3: Comparison of different spatial datasets and benchmarks.
  • Figure 4: Illustrations of different ways of integrating depth information.
  • Figure 5: The architecture of SD-VLM is designed to effectively integrate spatial information into vision-language models. We incorporate an additional depth estimation module, which is particularly useful when the ground-truth depth map is unavailable.
  • ...and 4 more figures