Table of Contents
Fetching ...

Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models

Nanxi Li, Xiang Wang, Yuanjie Chen, Haode Zhang, Hong Li, Yong-Lu Li

Abstract

While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in image and video understanding, their ability to comprehend the physical world has become an increasingly important research focus. Despite their improvements, current MLLMs struggle significantly with high-level physics reasoning. In this work, we investigate the first step of physical reasoning, i.e., intuitive physics understanding, revealing substantial limitations in understanding the dynamics of continuum objects. To isolate and evaluate this specific capability, we introduce two fundamental benchmark tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). Our experiments demonstrate that even state-of-the-art MLLMs perform poorly on these foundational tasks. To address this limitation, we propose Scene Dynamic Field (SDF), a concise approach that leverages physics simulators within a multi-task fine-tuning framework. SDF substantially improves performance, achieving up to 20.7% gains on fluid tasks while showing strong generalization to unseen physical domains. This work not only highlights a critical gap in current MLLMs but also presents a promising cost-efficient approach for developing more physically grounded MLLMs. Our code and data are available at https://github.com/andylinx/Scene-Dynamic-Field.

Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models

Abstract

While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in image and video understanding, their ability to comprehend the physical world has become an increasingly important research focus. Despite their improvements, current MLLMs struggle significantly with high-level physics reasoning. In this work, we investigate the first step of physical reasoning, i.e., intuitive physics understanding, revealing substantial limitations in understanding the dynamics of continuum objects. To isolate and evaluate this specific capability, we introduce two fundamental benchmark tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). Our experiments demonstrate that even state-of-the-art MLLMs perform poorly on these foundational tasks. To address this limitation, we propose Scene Dynamic Field (SDF), a concise approach that leverages physics simulators within a multi-task fine-tuning framework. SDF substantially improves performance, achieving up to 20.7% gains on fluid tasks while showing strong generalization to unseen physical domains. This work not only highlights a critical gap in current MLLMs but also presents a promising cost-efficient approach for developing more physically grounded MLLMs. Our code and data are available at https://github.com/andylinx/Scene-Dynamic-Field.

Paper Structure

This paper contains 23 sections, 6 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Existing benchmarks entangle multiple capabilities, leading to poor performance in SOTA MLLMs. To address this, we introduce two low-level tasks to assess intuitive physics understanding: Next Frame Selection and Temporal Coherence Verification. Our proposed Scene Dynamic Field (SDF) directly enhances MLLMs' dynamic understanding and shows strong generalization.
  • Figure 2: Illustration of our Scene Dynamic Field (SDF).
  • Figure 3: Our multitask framework integrates low-level tasks, a dynamic perception task, and an SDF-guided CoT reasoning task.
  • Figure 4: Performance of our SDF method across various evaluation scenarios. (A) shows results on the Fluid dataset for both NFS and TCV tasks. (B) and (C) present transfer results to cloth, smoke, and other particle-based objects on Qwen2-VL and GLM4.1V, respectively.
  • Figure 5: Stride ablation study on the NFS benchmark performance for Qwen2.5-VL and InternVL2.5.
  • ...and 9 more figures