SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark
TL;DR
This work addresses the challenge of 3D spatial understanding in multimodal LLMs by introducing SpatialThinker, a 3D-aware MLLM that fuses scene-graph grounded reasoning with online reinforcement learning. It leverages a dense, multi-objective reward framework—encompassing format, count, accuracy, and spatial grounding with lexicographic gating—and trains on a compact STVQA-7K dataset (extensible to ~108K) to learn robust spatial reasoning from limited data. SpatialThinker-7B achieves strong performance across twelve spatial and real-world VQA benchmarks, often surpassing open-source baselines and approaching or exceeding GPT-4o in several tasks, while demonstrating significantly better out-of-distribution generalization than sparse RL or supervised fine-tuning baselines. The results underscore the value of structured scene grounding combined with dense reward signals for data-efficient, human-like 3D spatial reasoning in vision-language models, with practical implications for embodied AI and real-world perception tasks.
Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.
