Table of Contents
Fetching ...

Chain of Event-Centric Causal Thought for Physically Plausible Video Generation

Zixuan Wang, Yixin Hu, Haolan Wang, Feng Chen, Yan Liu, Wen Li, Yinjie Lei

TL;DR

This framework designs two key modules that decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning, and embeds physical formulas as constraints to impose deterministic causal dependencies during reasoning.

Abstract

Physically Plausible Video Generation (PPVG) has emerged as a promising avenue for modeling real-world physical phenomena. PPVG requires an understanding of commonsense knowledge, which remains a challenge for video diffusion models. Current approaches leverage commonsense reasoning capability of large language models to embed physical concepts into prompts. However, generation models often render physical phenomena as a single moment defined by prompts, due to the lack of conditioning mechanisms for modeling causal progression. In this paper, we view PPVG as generating a sequence of causally connected and dynamically evolving events. To realize this paradigm, we design two key modules: (1) Physics-driven Event Chain Reasoning. This module decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning. To mitigate causal ambiguity, we embed physical formulas as constraints to impose deterministic causal dependencies during reasoning. (2) Transition-aware Cross-modal Prompting (TCP). To maintain continuity between events, this module transforms causal event units into temporally aligned vision-language prompts. It summarizes discrete event descriptions to obtain causally consistent narratives, while progressively synthesizing visual keyframes of individual events by interactive editing. Comprehensive experiments on PhyGenBench and VideoPhy benchmarks demonstrate that our framework achieves superior performance in generating physically plausible videos across diverse physical domains. Our code will be released soon.

Chain of Event-Centric Causal Thought for Physically Plausible Video Generation

TL;DR

This framework designs two key modules that decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning, and embeds physical formulas as constraints to impose deterministic causal dependencies during reasoning.

Abstract

Physically Plausible Video Generation (PPVG) has emerged as a promising avenue for modeling real-world physical phenomena. PPVG requires an understanding of commonsense knowledge, which remains a challenge for video diffusion models. Current approaches leverage commonsense reasoning capability of large language models to embed physical concepts into prompts. However, generation models often render physical phenomena as a single moment defined by prompts, due to the lack of conditioning mechanisms for modeling causal progression. In this paper, we view PPVG as generating a sequence of causally connected and dynamically evolving events. To realize this paradigm, we design two key modules: (1) Physics-driven Event Chain Reasoning. This module decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning. To mitigate causal ambiguity, we embed physical formulas as constraints to impose deterministic causal dependencies during reasoning. (2) Transition-aware Cross-modal Prompting (TCP). To maintain continuity between events, this module transforms causal event units into temporally aligned vision-language prompts. It summarizes discrete event descriptions to obtain causally consistent narratives, while progressively synthesizing visual keyframes of individual events by interactive editing. Comprehensive experiments on PhyGenBench and VideoPhy benchmarks demonstrate that our framework achieves superior performance in generating physically plausible videos across diverse physical domains. Our code will be released soon.
Paper Structure (12 sections, 11 equations, 7 figures, 4 tables)

This paper contains 12 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of our physically plausible video generation framework. We firstly decompose complex physical phenomena into a sequence of elementary events guided by physical formulas ( \ref{['sec:3.1']}), and secondly map logically ordered events to a holistic description and a set of keyframes, both of which are causally coherent ( \ref{['sec:3.2']}). Our inferred vision-language prompts enable off-the-shelf diffusion frameworks to generate videos capturing the causal progression of physical phenomena.
  • Figure 2: Overview of our PECR module ( \ref{['sec:3.1']}). This module conceptualizes physical phenomena in user-provided descriptions as a series of causally ordered events governed by real-world physical formulas, where each event encompasses semantic descriptions and measurable physical parameters for key objects. This characterizes the underlying scene changes induced by such phenomena.
  • Figure 3: Overview of our TCP module ( \ref{['sec:3.2']}). This module aims to generate semantic-visual prompts for each event based on its meta data. Semantic prompts are inferred by our proposed progressive narrative revision, serving as the guidance during denoising steps. Visual prompts are obtained by our proposed interactive keyframe synthesis, replacing original noise to provide physics-aware priors.
  • Figure 4: Visualization of physics-aware video generation results across four physical domains. Compared with baseline CogVideo-5B yang2024cogvideox, our approach yields causally coherent progressions of physical phenomena, e.g., the glass ball sinks, the bottom shadow extends in the direction of the light, gradual melting of ice, fire spreading through the paper. All prompts are sourced from PhyGenBench meng2024towards.
  • Figure 5: Visualization of physics-aware video generation results across various physical interactions between objects. Compared with baseline CogVideo-5B yang2024cogvideox, our approach demonstrates clearer causal progression, e.g., butter spread along the knife movement, continuous honey inflow with a rising level, and monotonic spring compression. All prompts are sourced from VideoPhy bansal2024videophy.
  • ...and 2 more figures