Table of Contents
Fetching ...

SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, Saining Xie

TL;DR

SIMS-V introduces a systematic, simulator-guided framework for generating spatially rich, labeled video data to train multimodal language models. By isolating and ablations of question types, data mixes, and data scale, it identifies a minimal 3Q mix (Absolute Distance, Relative Direction, Appearance Order) that achieves strong real-world transfer with only thousands of simulated examples, and even rivals larger proprietary models at modest data budgets. The approach demonstrates robust generalization across general video understanding, embodied reasoning, and real-world spatial tasks, while highlighting architectural considerations and limitations such as potential forgetting when training exclusively on simulated data. Overall, SIMS-V offers a scalable pathway to improve spatial reasoning in video-language models through controlled synthetic data and targeted supervision, with released resources to catalyze further research in sim-to-real spatial intelligence.

Abstract

Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.

SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

TL;DR

SIMS-V introduces a systematic, simulator-guided framework for generating spatially rich, labeled video data to train multimodal language models. By isolating and ablations of question types, data mixes, and data scale, it identifies a minimal 3Q mix (Absolute Distance, Relative Direction, Appearance Order) that achieves strong real-world transfer with only thousands of simulated examples, and even rivals larger proprietary models at modest data budgets. The approach demonstrates robust generalization across general video understanding, embodied reasoning, and real-world spatial tasks, while highlighting architectural considerations and limitations such as potential forgetting when training exclusively on simulated data. Overall, SIMS-V offers a scalable pathway to improve spatial reasoning in video-language models through controlled synthetic data and targeted supervision, with released resources to catalyze further research in sim-to-real spatial intelligence.

Abstract

Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.

Paper Structure

This paper contains 40 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: SIMS-V enables learning real-world spatial concepts in simulation. We generate spatially-rich videos with dense spatial annotations via privileged simulator data, creating diverse question-answer pairs. Models trained on this simulated data transfer effectively to real-world spatial reasoning benchmarks.
  • Figure 2: The SIMS-V pipeline generates diverse spatial training data with perfect ground truth. We procedurally generate 3D scenes using AI2-THOR kolve2017ai2, ProcTHOR deitke2022procthor, and Objaverse deitke2023objaverse, capture agent navigation trajectories, extract dense annotations (global spatial layout and per-frame observations), and programmatically generate quality-controlled question-answer pairs spanning diverse spatial reasoning categories. This systematic pipeline enables controlled ablations of question types and data configurations, maintaining perfect spatial ground truth.
  • Figure 3: Examples of different question types used in our experiments. Each question is shown alongside its corresponding visual context from a simulated environment. The questions span diverse spatial reasoning capabilities including numerical measurement, relative positioning, and temporal tracking. Full details of all question formats are provided in \ref{['app:question_templates']}.
  • Figure 4: Training on individual question types yields large on-task gains with localized cross-task effects. We fine-tune LLaVA-Video-7B on 5k simulated questions of each question type and format (rows), evaluating each model on all VSI-Bench question types (columns). Values are performance $\Delta$ vs. the pretrained baseline (positive is green, negative is red).
  • Figure 5: Minimal 3Q mix is more data-efficient than comprehensive coverage.Left: Both training mixes show rapid improvement on VSI-Bench, with 3Q consistently outperforming the full baseline mix despite using only three question types. At 5K examples, we surpass Gemini-1.5 Flash; at 25K, we approach Gemini-1.5 Pro. Right: Distribution of question types in VSI-Baseline mix, which mirrors the VSI-Bench test set composition.
  • ...and 1 more figures