Table of Contents
Fetching ...

MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

Shubo Lin, Xuanyang Zhang, Wei Cheng, Weiming Hu, Gang Yu, Jin Gao

Abstract

Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling. We recast perceptual cues, specifically semantics, geometry, and spatio-temporal trajectory, into a unified pseudo-RGB format, enabling VDMs to directly capture complex physical dynamics. To mitigate cross-modal interference, we propose a Bidirectionally Controlled Teacher architecture, which utilizes parallel branches to fully decouple RGB and perception processing and adopts two zero-initialized control links to gradually learn pixel-wise consistency. For inference efficiency, the teacher's physical prior is distilled into a single-stream student model via representation alignment. Furthermore, we present MMPhysPipe, a scalable data curation and annotation pipeline tailored for constructing physics-rich multimodal datasets. MMPhysPipe employs a vision-language model (VLM) guided by a chain-of-visual-evidence rule to pinpoint physical subjects, enabling expert models to extract multi-granular perceptual information. Without additional inference costs, MMPhysVideo consistently improves physical plausibility and visual quality over advanced models across various benchmarks and achieves state-of-the-art performance compared to existing methods.

MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

Abstract

Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling. We recast perceptual cues, specifically semantics, geometry, and spatio-temporal trajectory, into a unified pseudo-RGB format, enabling VDMs to directly capture complex physical dynamics. To mitigate cross-modal interference, we propose a Bidirectionally Controlled Teacher architecture, which utilizes parallel branches to fully decouple RGB and perception processing and adopts two zero-initialized control links to gradually learn pixel-wise consistency. For inference efficiency, the teacher's physical prior is distilled into a single-stream student model via representation alignment. Furthermore, we present MMPhysPipe, a scalable data curation and annotation pipeline tailored for constructing physics-rich multimodal datasets. MMPhysPipe employs a vision-language model (VLM) guided by a chain-of-visual-evidence rule to pinpoint physical subjects, enabling expert models to extract multi-granular perceptual information. Without additional inference costs, MMPhysVideo consistently improves physical plausibility and visual quality over advanced models across various benchmarks and achieves state-of-the-art performance compared to existing methods.

Paper Structure

This paper contains 30 sections, 8 equations, 19 figures, 7 tables.

Figures (19)

  • Figure 1: Overall framework of MMPhysVideo.Left: Our two-stage training framework, which starts with teacher models of parallel branches for joint modeling. Then, we distill a single-stream student model through representation alignment. Right: Our data engine, MMPhysPipe, for physics data curation and multimodal annotation.
  • Figure 2: Architecture comparison.Left: Channel-wise fusion used in ommivdiffvideojamMiddle Spatial-wise fusion used in unityvideo4dnexRight: Our decoupled design with pixel-wise fusion.
  • Figure 3: Overview of MMPhysVideo.Stage $I$: A dual-stream teacher model with parallel branches is first trained to handle RGB and perception modalities concurrently. Then, we use bidirectional control links to enable pixel-wise alignment. Stage $II$: For inference efficiency, we distill a single-stream student model through relation alginment.
  • Figure 4: Overview of MMPhysPipe. We employ a VLM, Qwen3-VL qwen3vl, to curate videos with rich physical interactions and generate physical subject descriptions following our chain-of-visual-evidence (CoVE) rule. Subsequently, expert perception models sam3vggtspatialtrackerv2 are leveraged to produce multi-granular annotations.
  • Figure 5: Qualitative results. We compare MMPhysVideo with backbones, CogVideoX (Cog) and Wan2.1 (Wan), alongside the advanced physics method, VideoREPA (VR).
  • ...and 14 more figures