SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, Saining Xie
TL;DR
SIMS-V introduces a systematic, simulator-guided framework for generating spatially rich, labeled video data to train multimodal language models. By isolating and ablations of question types, data mixes, and data scale, it identifies a minimal 3Q mix (Absolute Distance, Relative Direction, Appearance Order) that achieves strong real-world transfer with only thousands of simulated examples, and even rivals larger proprietary models at modest data budgets. The approach demonstrates robust generalization across general video understanding, embodied reasoning, and real-world spatial tasks, while highlighting architectural considerations and limitations such as potential forgetting when training exclusively on simulated data. Overall, SIMS-V offers a scalable pathway to improve spatial reasoning in video-language models through controlled synthetic data and targeted supervision, with released resources to catalyze further research in sim-to-real spatial intelligence.
Abstract
Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.
