Table of Contents
Fetching ...

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark

TL;DR

This work addresses the challenge of 3D spatial understanding in multimodal LLMs by introducing SpatialThinker, a 3D-aware MLLM that fuses scene-graph grounded reasoning with online reinforcement learning. It leverages a dense, multi-objective reward framework—encompassing format, count, accuracy, and spatial grounding with lexicographic gating—and trains on a compact STVQA-7K dataset (extensible to ~108K) to learn robust spatial reasoning from limited data. SpatialThinker-7B achieves strong performance across twelve spatial and real-world VQA benchmarks, often surpassing open-source baselines and approaching or exceeding GPT-4o in several tasks, while demonstrating significantly better out-of-distribution generalization than sparse RL or supervised fine-tuning baselines. The results underscore the value of structured scene grounding combined with dense reward signals for data-efficient, human-like 3D spatial reasoning in vision-language models, with practical implications for embodied AI and real-world perception tasks.

Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

TL;DR

This work addresses the challenge of 3D spatial understanding in multimodal LLMs by introducing SpatialThinker, a 3D-aware MLLM that fuses scene-graph grounded reasoning with online reinforcement learning. It leverages a dense, multi-objective reward framework—encompassing format, count, accuracy, and spatial grounding with lexicographic gating—and trains on a compact STVQA-7K dataset (extensible to ~108K) to learn robust spatial reasoning from limited data. SpatialThinker-7B achieves strong performance across twelve spatial and real-world VQA benchmarks, often surpassing open-source baselines and approaching or exceeding GPT-4o in several tasks, while demonstrating significantly better out-of-distribution generalization than sparse RL or supervised fine-tuning baselines. The results underscore the value of structured scene grounding combined with dense reward signals for data-efficient, human-like 3D spatial reasoning in vision-language models, with practical implications for embodied AI and real-world perception tasks.

Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.

Paper Structure

This paper contains 39 sections, 5 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Method overview of SpatialThinker. Our framework integrates structured scene-graph grounded reasoning with multi-objective dense RL to enhance 3D spatial understanding in multimodal large language models.
  • Figure 2: Distribution of QA types in STVQA-7K. The dataset spans a diverse range of spatial reasoning skills, covering spatial relations, localization, existence, reach, depth, distance, size, count and orientation.
  • Figure 3: Qualitative comparison between GPT-4o and SpatialThinker-7B. GPT-4o often fails to distinguish objects at a 3D relational level—for example, confusing spatial relations such as beside, behind, and in front of, or missing fine-grained object details.
  • Figure 4: STVQA-7K dataset construction pipeline.
  • Figure 5: Examples of generated QA pairs across the nine spatial reasoning categories in STVQA-7K. Each category highlights distinct reasoning skills, ranging from relative spatial relations and depth ordering to distance, size, orientation, reach, location, count and existence.
  • ...and 2 more figures