Table of Contents
Fetching ...

PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection

Peiyao Wang, Weining Wang, Qi Li

TL;DR

PhysCorr targets the core gap in text-to-video generation: physical impossibilities in dynamic scenes. It introduces PhysicsRM, a lightweight, dual-component reward model combining subject-consistency and mechanics verification, and PhyDPO, a physics-aware, reweighted Direct Preference Optimization that prioritizes high-impact physical corrections. By distilling knowledge to a compact reward model and employing a score-driven reweighting scheme, PhysCorr improves physical plausibility across leading backbones without sacrificing semantic alignment or visual quality. Extensive experiments on VBench and VBench2 demonstrate broad improvements in temporal stability, interaction fidelity, and motion realism, establishing a practical pathway toward trustworthy physics-aware video generation.

Abstract

Recent advances in text-to-video generation have achieved impressive perceptual quality, yet generated content often violates fundamental principles of physical plausibility - manifesting as implausible object dynamics, incoherent interactions, and unrealistic motion patterns. Such failures hinder the deployment of video generation models in embodied AI, robotics, and simulation-intensive domains. To bridge this gap, we propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation. Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions. On this foundation, we develop PhyDPO, a novel direct preference optimization pipeline that leverages contrastive feedback and physics-aware reweighting to guide generation toward physically coherent outputs. Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones. Extensive experiments across multiple benchmarks demonstrate that PhysCorr achieves significant improvements in physical realism while preserving visual fidelity and semantic alignment. This work takes a critical step toward physically grounded and trustworthy video generation.

PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection

TL;DR

PhysCorr targets the core gap in text-to-video generation: physical impossibilities in dynamic scenes. It introduces PhysicsRM, a lightweight, dual-component reward model combining subject-consistency and mechanics verification, and PhyDPO, a physics-aware, reweighted Direct Preference Optimization that prioritizes high-impact physical corrections. By distilling knowledge to a compact reward model and employing a score-driven reweighting scheme, PhysCorr improves physical plausibility across leading backbones without sacrificing semantic alignment or visual quality. Extensive experiments on VBench and VBench2 demonstrate broad improvements in temporal stability, interaction fidelity, and motion realism, establishing a practical pathway toward trustworthy physics-aware video generation.

Abstract

Recent advances in text-to-video generation have achieved impressive perceptual quality, yet generated content often violates fundamental principles of physical plausibility - manifesting as implausible object dynamics, incoherent interactions, and unrealistic motion patterns. Such failures hinder the deployment of video generation models in embodied AI, robotics, and simulation-intensive domains. To bridge this gap, we propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation. Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions. On this foundation, we develop PhyDPO, a novel direct preference optimization pipeline that leverages contrastive feedback and physics-aware reweighting to guide generation toward physically coherent outputs. Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones. Extensive experiments across multiple benchmarks demonstrate that PhysCorr achieves significant improvements in physical realism while preserving visual fidelity and semantic alignment. This work takes a critical step toward physically grounded and trustworthy video generation.

Paper Structure

This paper contains 20 sections, 12 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: The videos generated by VideoCrafter2 using (a) "Big waves splashing on rocky cliffs" and (b) "Preparing meat for barbecue".
  • Figure 2: PhysCorr pipeline. We propose (a) PhysicsRM integrating subject-consistency module and mechanics module to quantify physical plausibility (bottom). For each prompt $p$, we generate $N$ videos using the target video diffusion model and compute their PhysicsRM-derived PhyScores. The highest-scoring video (physically plausible) and lowest-scoring video (physically implausible) form a preference pair for training. During (b) PhyDPO training, preference pairs are reweighted based on their PhyScore differance. Pairs with larger PhyScore difference (highlighting severe physical errors) receive higher weights, forcing the model to prioritize correcting egregious physical inaccuracies.
  • Figure 3: Analysis of PhyScore. The histogram of PhyScore and (left) the histogram of the difference in PhyScore between the best and the worst samples in a preference pair (right), showing significant sample differences which are beneficial for training.
  • Figure 4: Comparison of key metrics before and after PhysCorr on VBench and VBench2 for VideoCrafter2 and Wan2.1. We divide all metrics into two categories. Technical Fidelity Metrics (left) evaluate the low-level execution quality of generated videos, focusing on stability, perceptual accuracy, and localized consistency. Semantic Coherence Metrics (right) assess high-level semantic logic and narrative integrity.
  • Figure 5: The impact of $\alpha$ on the five key metrics of VBench and VBench2 - aesthetic quality, mechanics, thermotics, imaging quality and scene.
  • ...and 1 more figures