Table of Contents
Fetching ...

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Haoran Lu, Shang Wu, Jianshu Zhang, Maojiang Su, Guo Ye, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

TL;DR

Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance.

Abstract

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

TL;DR

Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance.

Abstract

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/
Paper Structure (113 sections, 39 equations, 6 figures, 7 tables)

This paper contains 113 sections, 39 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 2: Overview of Phys4D simulation framework. (a)Left A Large-scale async data collection pipeline. (b)Right Supported physical objects: rigid and articulated structures, garments, fluids, thermodynamics, deformable, inflatables, ropes, and granular materials.
  • Figure 3: Overview of the Phys4D training pipeline. Our three-stage paradigm progressively injects physics into a pretrained video diffusion model. Stage 1 (blue): The DiT backbone is frozen while depth and motion heads are trained on pseudo-labeled RGB videos. Stage 2 (green): The backbone is adapted via LoRA using physics simulation data with ground-truth annotations; a warp consistency loss couples depth and motion predictions. Stage 3 (orange): Generated RGB-D-motion outputs are lifted to 4D point clouds and compared against simulator ground-truth via 4D Chamfer Distance; PPO optimization uses this reward to correct residual physical violations. The output is a generated video with temporally consistent depth maps and motion fields.
  • Figure 4: Qualitative comparison on Physics-IQ scenarios. We compare Wan2.2-5b baseline (middle) with Wan2.2-5b + Phys4D (right) across three physical interaction types: object placement on a rotating platform (top), ball rolling dynamics (middle), and fluid pouring (bottom). Phys4D produces more consistent object geometry, physically plausible motion, and stable temporal dynamics compared to the baseline, which exhibits shape distortion and incoherent physical behavior.
  • Figure 5: Qualitative Results on 4D Experiment. From left to right: the ground-truth 4D point cloud and the generated 4D point cloud at $1/4$ of the sequence; the generated 4D point cloud at $1/2$ of the sequence; the novel-time generated 4D point cloud at $3/4$ of the sequence; and the generated 4D point cloud at the final frame. This visualization highlights the model’s ability to maintain coherent geometry and object motion over long horizons, as well as to interpolate consistent 4D structure at unseen timestamps.
  • Figure 6: Qualitative Result Of Phys4D Generated Scene.
  • ...and 1 more figures