Table of Contents
Fetching ...

Thinking with Spatial Code for Physical-World Video Reasoning

Jieneng Chen, Wenxin Ma, Ruisheng Yuan, Yunzhi Zhang, Jiajun Wu, Alan Yuille

TL;DR

This work proposes the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference.

Abstract

We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is available at https://github.com/Beckschen/spatialcode.

Thinking with Spatial Code for Physical-World Video Reasoning

TL;DR

This work proposes the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference.

Abstract

We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is available at https://github.com/Beckschen/spatialcode.
Paper Structure (53 sections, 24 equations, 8 figures, 14 tables)

This paper contains 53 sections, 24 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: Thinking with Spatial Code enables superior spatial reasoning from video.Left: Unlike current state-of-the-art multimodal LLMs (MLLMs) that reason directly over the raw RGB image or video, our approach first parses video into explicit 3D spatial codes, then prompts a text-only LLM to reason over these symbolic descriptions. Right: On VSI-Bench yang2025thinking, our method fine-tuned on Qwen3-4B yang2025qwen3 significantly outperforms leading MLLMs including GPT-5o, Gemini-2.5, and Qwen3-VL in video-spatial reasoning accuracy. Reinforcement learning with spatial rubric rewards further improves performance. Dot size indicates model scale (4B--230B parameters; GPT and Gemini sizes are undisclosed). This demonstrates that the quality of 3D spatial representation, rather than model scale alone, is the key bottleneck for spatial reasoning.
  • Figure 2: Overview.(a) Encoding Video to Spatial Code: The Spatial Encoder processes video through a dual-encoder architecture. The SAM-2 ravi2024sam encoder extracts object-level features $F_{\text{sam}}$ with temporal attention, while the Depth Encoder (from Depth Anything 3 lin2025depth) extracts spatial features $F_{\text{dep}}$. Cross-attention fuses these representations into $F_{\text{ca}}$, which feeds into a 3D Head for predicting 3D object bounding boxes with 3D orientation and a Depth Head for dense geometric supervision. Outputs are structured into symbolic spatial codes encoding object categories, positions, sizes, and orientations. (b) Prompting LLMs with Spatial Code: The spatial codes serve as explicit, interpretable inputs to LLMs for spatial reasoning. Given a query requiring perspective-aware understanding (e.g., "Where is Table1 relative to the Sofa, from sofa's perspective?"), the LLM reasons directly over the structured 3D representations to produce geometrically grounded answers.
  • Figure 3: Comparison of Spatial Rubric Reward (c) against conventional SFT (a) and RL (b). Unlike traditional methods, our framework utilizes the 3D spatial codes as primary input. Applying a structured spatial rubric reward to model rollouts significantly improves the quality of spatial reasoning.
  • Figure 4: Qualitative comparison. We show three examples where Thinking with Spatial Code succeeds while Gemini 2.5 Pro fails. (a) Perspective-aware reasoning with detailed reasoning trace: The question requires model to reason from a specific observer viewpoint. Video-based models confuse absolute positions with observer-relative directions, while our spatial codes enable step-by-step coordinate transformation with precise calculation and significantly improves reasoning accuracy. (b) Orientation-aware reasoning: The question requires understanding object orientation (yaw angles). MLLMs rely on visual appearance, while our spatial codes provide explicit orientation parameters for accurate inference. (c) 3D distance estimation: The task requires metric depth measurements. MLLMs use ambiguous 2D visual cues, while our spatial codes provide precise 3D coordinates for reliable distance calculation.
  • Figure 5: High-level prompt structure for spatial reasoning tasks.
  • ...and 3 more figures