Table of Contents
Fetching ...

Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning

Yihong Tang, Ao Qu, Zhaokai Wang, Dingyi Zhuang, Zhaofeng Wu, Wei Ma, Shenhao Wang, Yunhan Zheng, Zhan Zhao, Jinhua Zhao

TL;DR

Vision-language models struggle with 2D spatial reasoning critical for navigation and interaction. The authors decompose spatial understanding into direction comprehension, distance estimation, and localization, and introduce Sparkle—a synthetic-data framework—to train these basics and test their compositional generalization to tasks like SPP and TSP. Sparkle yields substantial gains on basic and composite tasks across multiple VLMs and improves out-of-distribution generalization to real-world spatial benchmarks, without sacrificing overall performance. The work suggests a systematic, data-efficient path to boost spatial understanding in VLMs via targeted synthetic supervision, with potential impact on embodied AI and real-world multimodal reasoning. Limitations include reliance on synthetic 2D visuals and a focus on basic spatial capabilities, pointing to future work in temporal and 3D spatial reasoning.

Abstract

Vision language models (VLMs) perform well on many tasks but often fail at spatial reasoning, which is essential for navigation and interaction with physical environments. Many spatial reasoning tasks depend on fundamental two-dimensional (2D) skills, yet our evaluation shows that state-of-the-art VLMs give implausible or incorrect answers to composite spatial problems, including simple pathfinding tasks that humans solve effortlessly. To address this, we enhance 2D spatial reasoning in VLMs by training them only on basic spatial capabilities. We first disentangle 2D spatial reasoning into three core components: direction comprehension, distance estimation, and localization. We hypothesize that mastering these skills substantially improves performance on complex spatial tasks that require advanced reasoning and combinatorial problem solving, while also generalizing to real-world scenarios. To test this, we introduce Sparkle, a framework that generates synthetic data to provide targeted supervision across these three capabilities and yields an instruction dataset for each. Experiments show that VLMs fine-tuned with \emph{Sparkle} improve not only on basic tasks but also on composite and out-of-distribution real-world spatial reasoning tasks. These results indicate that enhancing basic spatial skills through synthetic generalization effectively advances complex spatial reasoning and offers a systematic strategy for boosting the spatial understanding of VLMs. Source codes of Sparkle are available at https://github.com/YihongT/Sparkle.

Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning

TL;DR

Vision-language models struggle with 2D spatial reasoning critical for navigation and interaction. The authors decompose spatial understanding into direction comprehension, distance estimation, and localization, and introduce Sparkle—a synthetic-data framework—to train these basics and test their compositional generalization to tasks like SPP and TSP. Sparkle yields substantial gains on basic and composite tasks across multiple VLMs and improves out-of-distribution generalization to real-world spatial benchmarks, without sacrificing overall performance. The work suggests a systematic, data-efficient path to boost spatial understanding in VLMs via targeted synthetic supervision, with potential impact on embodied AI and real-world multimodal reasoning. Limitations include reliance on synthetic 2D visuals and a focus on basic spatial capabilities, pointing to future work in temporal and 3D spatial reasoning.

Abstract

Vision language models (VLMs) perform well on many tasks but often fail at spatial reasoning, which is essential for navigation and interaction with physical environments. Many spatial reasoning tasks depend on fundamental two-dimensional (2D) skills, yet our evaluation shows that state-of-the-art VLMs give implausible or incorrect answers to composite spatial problems, including simple pathfinding tasks that humans solve effortlessly. To address this, we enhance 2D spatial reasoning in VLMs by training them only on basic spatial capabilities. We first disentangle 2D spatial reasoning into three core components: direction comprehension, distance estimation, and localization. We hypothesize that mastering these skills substantially improves performance on complex spatial tasks that require advanced reasoning and combinatorial problem solving, while also generalizing to real-world scenarios. To test this, we introduce Sparkle, a framework that generates synthetic data to provide targeted supervision across these three capabilities and yields an instruction dataset for each. Experiments show that VLMs fine-tuned with \emph{Sparkle} improve not only on basic tasks but also on composite and out-of-distribution real-world spatial reasoning tasks. These results indicate that enhancing basic spatial skills through synthetic generalization effectively advances complex spatial reasoning and offers a systematic strategy for boosting the spatial understanding of VLMs. Source codes of Sparkle are available at https://github.com/YihongT/Sparkle.

Paper Structure

This paper contains 45 sections, 1 equation, 16 figures, 3 tables.

Figures (16)

  • Figure 1: VLMs fail to solve the pathfinding problem
  • Figure 2: The proposed Sparkle framework.
  • Figure 3: An instruction data sample from Sparkle.
  • Figure 4: Evaluation samples used in our experiments.
  • Figure 5: Sparkle variants: Sparkle $\blacksquare$; Sparkle without numerical information $\blacksquare$; Sparkle (Localization) $\blacksquare$; Sparkle (Distance) $\blacksquare$; Sparkle (Direction) $\blacksquare$.
  • ...and 11 more figures