Table of Contents
Fetching ...

Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

Yutong Hao, Chen Chen, Ajmal Saeed Mian, Chang Xu, Daochang Liu

TL;DR

Diffusion-based video models frequently violate basic physics; this work introduces a training-free framework that first reason about implausible physics with a physics-aware reasoning (PAR) module to generate targeted counterfactual prompts, then guides generation with Synchronized Decoupled Guidance (SDG) to suppress implausible content early and consistently. PAR enriches prompts with explicit physical context, while SDG uses synchronized directional normalization and trajectory-decoupled denoising to overcome lagged suppression and cumulative trajectory bias. Empirical results on PhyGenBench and VideoPhy across mechanics, fluids, optics, and thermodynamics show consistent improvements in physical plausibility and preserved photorealism, without retraining the diffusion models. Ablation studies confirm that both PAR and the two SDG designs are necessary and complementary, establishing a plug-and-play, inference-time physics-aware paradigm for video generation. Overall, the approach offers a scalable, training-free path to more physically plausible video synthesis applicable to diverse domains and backbones.

Abstract

Diffusion models can generate realistic videos, but existing methods rely on implicitly learning physical reasoning from large-scale text-video datasets, which is costly, difficult to scale, and still prone to producing implausible motions that violate fundamental physical laws. We introduce a training-free framework that improves physical plausibility at inference time by explicitly reasoning about implausibility and guiding the generation away from it. Specifically, we employ a lightweight physics-aware reasoning pipeline to construct counterfactual prompts that deliberately encode physics-violating behaviors. Then, we propose a novel Synchronized Decoupled Guidance (SDG) strategy, which leverages these prompts through synchronized directional normalization to counteract lagged suppression and trajectory-decoupled denoising to mitigate cumulative trajectory bias, ensuring that implausible content is suppressed immediately and consistently throughout denoising. Experiments across different physical domains show that our approach substantially enhances physical fidelity while maintaining photorealism, despite requiring no additional training. Ablation studies confirm the complementary effectiveness of both the physics-aware reasoning component and SDG. In particular, the aforementioned two designs of SDG are also individually validated to contribute critically to the suppression of implausible content and the overall gains in physical plausibility. This establishes a new and plug-and-play physics-aware paradigm for video generation.

Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

TL;DR

Diffusion-based video models frequently violate basic physics; this work introduces a training-free framework that first reason about implausible physics with a physics-aware reasoning (PAR) module to generate targeted counterfactual prompts, then guides generation with Synchronized Decoupled Guidance (SDG) to suppress implausible content early and consistently. PAR enriches prompts with explicit physical context, while SDG uses synchronized directional normalization and trajectory-decoupled denoising to overcome lagged suppression and cumulative trajectory bias. Empirical results on PhyGenBench and VideoPhy across mechanics, fluids, optics, and thermodynamics show consistent improvements in physical plausibility and preserved photorealism, without retraining the diffusion models. Ablation studies confirm that both PAR and the two SDG designs are necessary and complementary, establishing a plug-and-play, inference-time physics-aware paradigm for video generation. Overall, the approach offers a scalable, training-free path to more physically plausible video synthesis applicable to diverse domains and backbones.

Abstract

Diffusion models can generate realistic videos, but existing methods rely on implicitly learning physical reasoning from large-scale text-video datasets, which is costly, difficult to scale, and still prone to producing implausible motions that violate fundamental physical laws. We introduce a training-free framework that improves physical plausibility at inference time by explicitly reasoning about implausibility and guiding the generation away from it. Specifically, we employ a lightweight physics-aware reasoning pipeline to construct counterfactual prompts that deliberately encode physics-violating behaviors. Then, we propose a novel Synchronized Decoupled Guidance (SDG) strategy, which leverages these prompts through synchronized directional normalization to counteract lagged suppression and trajectory-decoupled denoising to mitigate cumulative trajectory bias, ensuring that implausible content is suppressed immediately and consistently throughout denoising. Experiments across different physical domains show that our approach substantially enhances physical fidelity while maintaining photorealism, despite requiring no additional training. Ablation studies confirm the complementary effectiveness of both the physics-aware reasoning component and SDG. In particular, the aforementioned two designs of SDG are also individually validated to contribute critically to the suppression of implausible content and the overall gains in physical plausibility. This establishes a new and plug-and-play physics-aware paradigm for video generation.

Paper Structure

This paper contains 33 sections, 13 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Overall framework. Left: Physics-Aware Reasoning (PAR). Given a user prompt, an LLM identifies entities, interactions, and scene conditions to produce a structured analysis of the underlying physical process. Based on this reasoning, it constructs counterfactual prompts that preserve the same entities and scenes but deliberately violate the governing physical law, yielding targeted physics-aware negatives. Right: Synchronized Decoupled Guidance (SDG). During denoising, we evolve two branches conditioned on the user prompt and the counterfactual prompt, respectively. Their noise estimates are combined with directional normalization and trajectory decoupling, ensuring that implausible structures are suppressed immediately and consistently throughout generation.
  • Figure 2: Qualitative comparison with Wan2.1. Prompt: "A vibrant, elastic tennis ball is thrown forcefully towards the ground, capturing its dynamic interaction with the surface upon impact." Baseline: The tennis ball’s motion is inconsistent with gravity-driven dynamics, with limited deformation on impact and abrupt transitions across frames. The bounce lacks elasticity. Ours: Our result shows a more natural downward trajectory, visible compression upon impact, and a smoother rebound trajectory, yielding a closer match to expected mechanics.
  • Figure 3: Qualitative comparison with CogvideoX. Prompt: "A yellow highlighter is used to mark on the rough, brown surface of a cardboard, showcasing the interaction between the highlighter and the cardboard surface." Baseline: Generates inconsistent strokes, with the yellow mark appearing flat and disconnected from the cardboard’s texture. The contact point with the marker is visually unconvincing. Ours: Produces a stroke that properly adheres to the surface, with the ink visibly blending with the cardboard texture. The pen-surface interaction is sharper and more consistent.
  • Figure 4: Qualitative comparison with Wan2.1. Prompt: "A silver spoon is slowly inserted into a glass of crystal-clear water, revealing the fascinating visual changes and reflections as the spoon interacts with the liquid." Baseline: The generated sequence struggles to capture realistic refraction and liquid interaction. The spoon appears disconnected from the water surface, and the reflections lack physical plausibility. Ours: Our method produces a coherent depiction of the spoon entering the water, with realistic ripples, refraction, and surface reflections. This creates a more physically faithful impression of object-fluid interaction.
  • Figure 5: Qualitative comparison with Wan2.1. Prompt: "A timelapse captures the transformation of water in a pot as the temperature rapidly rises above 100°C." Baseline: The sequence unrealistically depicts explosive splashes, ignoring the gradual bubbling and vapor release expected from water heating above 100°C. Ours: Our method captures progressive bubbling and the formation of rising vapor clouds, consistent with the condensation process. This produces a more physically plausible thermal interaction.
  • ...and 9 more figures