Table of Contents
Fetching ...

Visual Spatial Tuning

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao

TL;DR

This work introduces Visual Spatial Tuning (VST), a unified framework to instill visuospatial perception and reasoning in Vision-Language Models without adding specialized encoders. It pairs a large perception dataset (VST-P) with a reasoning dataset (VST-R) in a progressive training pipeline, optionally extending to Vision-Language-Action (VLA) for grounded robotics. Empirical results show state-of-the-art performance on multiple spatial benchmarks and notable gains in VLA tasks, while preserving general multimodal capabilities. The approach demonstrates that scalable spatial priors can be learned and transferred to broader AI tasks, advancing physically grounded intelligence.

Abstract

Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8\%$ on MMSI-Bench and $61.2\%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.

Visual Spatial Tuning

TL;DR

This work introduces Visual Spatial Tuning (VST), a unified framework to instill visuospatial perception and reasoning in Vision-Language Models without adding specialized encoders. It pairs a large perception dataset (VST-P) with a reasoning dataset (VST-R) in a progressive training pipeline, optionally extending to Vision-Language-Action (VLA) for grounded robotics. Empirical results show state-of-the-art performance on multiple spatial benchmarks and notable gains in VLA tasks, while preserving general multimodal capabilities. The approach demonstrates that scalable spatial priors can be learned and transferred to broader AI tasks, advancing physically grounded intelligence.

Abstract

Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including on MMSI-Bench and on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.

Paper Structure

This paper contains 22 sections, 3 equations, 14 figures, 33 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of our VST framework.
  • Figure 2: Overview of the VST dataset. (a) The distribution of VST-P, which is used for SFT. (b) The distribution of VST-R, which is used for CoT cold start and RL. 'SR' denotes spatial reasoning, and 'GR' denotes general reasoning.
  • Figure 3: Data engines of VST (left) and the capabilities they enable in VST-Model (right).
  • Figure 4: (a) The VST model, which incorporates spatial perception and reasoning capabilities. (b) The VST-based VLA model, capable of generating action sequences through an action de-tokenizer.
  • Figure 5: Comparison with state-of-the-art VLMs on VSI-Bench vsibench.
  • ...and 9 more figures