Table of Contents
Fetching ...

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

Xiyang Wu, Zongxia Li, Jihui Jin, Guangyao Shi, Gouthaman KV, Vishnu Raj, Nilotpal Sinha, Jingxi Chen, Fan Du, Dinesh Manocha

TL;DR

The paper addresses the gap in physics-driven reasoning for vision-language systems by introducing MASS-Bench, a large-scale benchmark with dense spatiotemporal annotations, and MASS, a model-agnostic module that injects depth-based 3D encoding and motion-grounded cues into the language space. MASS employs entity-centric grounding, 3D motion tracking, and depth cues, serialized into language-aligned representations, and is further enhanced through reinforcement fine-tuning (GRPO) to improve cross-modal reasoning. Empirical results show MASS-based refinements outperform strong baselines by up to 8.7% and 6.0%, approaching the performance of close-source state-of-the-art models on physics reasoning and comprehension, validating the effectiveness of explicit spatiotemporal grounding for physics-aware video understanding. The work highlights the importance of reasoning over grounded cues and indicates future directions in long-range motion grounding, scalable tracking, and richer training data to further reduce hallucinations in physics-centered video reasoning.

Abstract

Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-driven reasoning involving motion dynamics and spatial interactions. This limitation reduces their ability to interpret real or AI-generated content (AIGC) videos and to generate physically consistent content. We present an approach that addresses this gap by translating physical-world context cues into interpretable representations aligned with VLMs' perception, comprehension, and reasoning. We introduce MASS-Bench, a comprehensive benchmark consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections, sub-segment grounding, and full-sequence 3D motion tracking of entities. We further present MASS, a model-agnostic method that injects spatial-temporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning. Experiments and ablations show that our refined VLMs outperform comparable and larger baselines, as well as prior state-of-the-art models, by 8.7% and 6.0%, achieving performance comparable to close-source SoTA VLMs such as Gemini-2.5-Flash on physics reasoning and comprehension. These results validate the effectiveness of our approach.

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

TL;DR

The paper addresses the gap in physics-driven reasoning for vision-language systems by introducing MASS-Bench, a large-scale benchmark with dense spatiotemporal annotations, and MASS, a model-agnostic module that injects depth-based 3D encoding and motion-grounded cues into the language space. MASS employs entity-centric grounding, 3D motion tracking, and depth cues, serialized into language-aligned representations, and is further enhanced through reinforcement fine-tuning (GRPO) to improve cross-modal reasoning. Empirical results show MASS-based refinements outperform strong baselines by up to 8.7% and 6.0%, approaching the performance of close-source state-of-the-art models on physics reasoning and comprehension, validating the effectiveness of explicit spatiotemporal grounding for physics-aware video understanding. The work highlights the importance of reasoning over grounded cues and indicates future directions in long-range motion grounding, scalable tracking, and richer training data to further reduce hallucinations in physics-centered video reasoning.

Abstract

Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-driven reasoning involving motion dynamics and spatial interactions. This limitation reduces their ability to interpret real or AI-generated content (AIGC) videos and to generate physically consistent content. We present an approach that addresses this gap by translating physical-world context cues into interpretable representations aligned with VLMs' perception, comprehension, and reasoning. We introduce MASS-Bench, a comprehensive benchmark consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections, sub-segment grounding, and full-sequence 3D motion tracking of entities. We further present MASS, a model-agnostic method that injects spatial-temporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning. Experiments and ablations show that our refined VLMs outperform comparable and larger baselines, as well as prior state-of-the-art models, by 8.7% and 6.0%, achieving performance comparable to close-source SoTA VLMs such as Gemini-2.5-Flash on physics reasoning and comprehension. These results validate the effectiveness of our approach.

Paper Structure

This paper contains 22 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Physics-Centric Video Question Answering. Physics-aware video comprehension is challenging, as VLMs must capture fine-grained spatial–temporal cues and integrate them for higher-level reasoning. MASS introduces a motion-aware spatial–temporal grounding module that explicitly encodes object motions and scene dynamics into the language space. By enriching VLMs with structured spatial, temporal, and semantic signals, MASS significantly improves downstream reasoning, including motion and action understanding, physical-process inference, and abnormality detection (e.g., identifying the counterfactual upward motion of a basketball). MASS outperforms strong SoTA models such as GPT-4o and Gemini-2.5-Flash, demonstrating robust physics comprehension and reasoning across diverse tasks.
  • Figure 2: Data Exhibition of MASS-Bench. MASS-Bench provides two question types—factual and critical-thinking—to evaluate physics-driven video understanding. For each video–question–answer pair, we supply rich motion-grounding annotations, including temporal segmentation, entity-level visual grounding, temporal profiles across the full video, and motion attributes such as first/last positions and 3D displacement vectors. These structured spatial–temporal cues transform complex physics-related perception into interpretable representations that support more reliable physical reasoning. Additional dataset details are provided in the Appendix \ref{['app:more_detail_dataset']}.
  • Figure 3: Overview of MASS: We use a model-agnostic approach to enhance visual recognition with explicit spatial and motion awareness. Beyond standard visual transformer encoders that process video inputs (e.g., LLaVA-OneVision li2024llava, Qwen2.5-VL bai2025qwen2), we introduce a visual grounding module to strengthen correlations between queried entities and corresponding visual cues. Depth estimation captures spatial geometry, while motion tracking encodes temporal dynamics across frames. These spatial and temporal signals are fused into motion traces for each entity and tokenized with grounding and temporal features to align them with the language domain. During post-training, we freeze the spatial–temporal encoders and apply reinforcement fine-tuning (RFT) to improve the LLM backbone’s comprehension of the additional multimodal information.
  • Figure 4: Prompt template used for motion-aware video question answering. The template first serializes entity-level motion grounding (positions, motion vectors, bounding boxes, and frame ranges) into text, then injects this context into a chain-of-thought style prompt that guides the VLM to reason in <think> tags and output its final prediction in standardized <answer> tags.
  • Figure 5: Prompt template used for automatic evaluation of model answers against ground-truth references. The template presents the question, ground truth, and model output provided for LLM-as-a-judge evaluation and instructs the evaluator to output one of three outcomes—Correct, Incorrect, or Unclear—ensuring reliable and consistent scoring across predictions.
  • ...and 5 more figures