MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
Xiyang Wu, Zongxia Li, Jihui Jin, Guangyao Shi, Gouthaman KV, Vishnu Raj, Nilotpal Sinha, Jingxi Chen, Fan Du, Dinesh Manocha
TL;DR
The paper addresses the gap in physics-driven reasoning for vision-language systems by introducing MASS-Bench, a large-scale benchmark with dense spatiotemporal annotations, and MASS, a model-agnostic module that injects depth-based 3D encoding and motion-grounded cues into the language space. MASS employs entity-centric grounding, 3D motion tracking, and depth cues, serialized into language-aligned representations, and is further enhanced through reinforcement fine-tuning (GRPO) to improve cross-modal reasoning. Empirical results show MASS-based refinements outperform strong baselines by up to 8.7% and 6.0%, approaching the performance of close-source state-of-the-art models on physics reasoning and comprehension, validating the effectiveness of explicit spatiotemporal grounding for physics-aware video understanding. The work highlights the importance of reasoning over grounded cues and indicates future directions in long-range motion grounding, scalable tracking, and richer training data to further reduce hallucinations in physics-centered video reasoning.
Abstract
Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-driven reasoning involving motion dynamics and spatial interactions. This limitation reduces their ability to interpret real or AI-generated content (AIGC) videos and to generate physically consistent content. We present an approach that addresses this gap by translating physical-world context cues into interpretable representations aligned with VLMs' perception, comprehension, and reasoning. We introduce MASS-Bench, a comprehensive benchmark consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections, sub-segment grounding, and full-sequence 3D motion tracking of entities. We further present MASS, a model-agnostic method that injects spatial-temporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning. Experiments and ablations show that our refined VLMs outperform comparable and larger baselines, as well as prior state-of-the-art models, by 8.7% and 6.0%, achieving performance comparable to close-source SoTA VLMs such as Gemini-2.5-Flash on physics reasoning and comprehension. These results validate the effectiveness of our approach.
