Table of Contents
Fetching ...

DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation

Abolfazl Meyarian, Amin Karimi Monsefi, Rajiv Ramnath, Ser-Nam Lim

Abstract

Flow-matching video generators produce temporally coherent, high-fidelity outputs yet routinely violate elementary physics because their reconstruction objectives penalize per-frame deviations without distinguishing physically consistent dynamics from impossible ones. Contrastive flow matching offers a principled remedy by pushing apart velocity-field trajectories of differing conditions, but we identify a fundamental obstacle in the text-conditioned video setting: semantic-physics entanglement. Because natural-language prompts couple scene content with physical behavior, naive negative sampling draws conditions whose velocity fields largely overlap with the positive sample's, causing the contrastive gradient to directly oppose the flow-matching objective. We formalize this gradient conflict, deriving a precise alignment condition that reveals when contrastive learning helps versus harms training. Guided by this analysis, we introduce DiReCT (Disentangled Regularization of Contrastive Trajectories), a lightweight post-training framework that decomposes the contrastive signal into two complementary scales: a macro-contrastive term that draws partition-exclusive negatives from semantically distant regions for interference-free global trajectory separation, and a micro-contrastive term that constructs hard negatives sharing full scene semantics with the positive sample but differing along a single, LLM-perturbed axis of physical behavior; spanning kinematics, forces, materials, interactions, and magnitudes. A velocity-space distributional regularizer helps to prevent catastrophic forgetting of pretrained visual quality. When applied to Wan 2.1-1.3B, our method improves the physical commonsense score on VideoPhy by 16.7% and 11.3% compared to the baseline and SFT, respectively, without increasing training time.

DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation

Abstract

Flow-matching video generators produce temporally coherent, high-fidelity outputs yet routinely violate elementary physics because their reconstruction objectives penalize per-frame deviations without distinguishing physically consistent dynamics from impossible ones. Contrastive flow matching offers a principled remedy by pushing apart velocity-field trajectories of differing conditions, but we identify a fundamental obstacle in the text-conditioned video setting: semantic-physics entanglement. Because natural-language prompts couple scene content with physical behavior, naive negative sampling draws conditions whose velocity fields largely overlap with the positive sample's, causing the contrastive gradient to directly oppose the flow-matching objective. We formalize this gradient conflict, deriving a precise alignment condition that reveals when contrastive learning helps versus harms training. Guided by this analysis, we introduce DiReCT (Disentangled Regularization of Contrastive Trajectories), a lightweight post-training framework that decomposes the contrastive signal into two complementary scales: a macro-contrastive term that draws partition-exclusive negatives from semantically distant regions for interference-free global trajectory separation, and a micro-contrastive term that constructs hard negatives sharing full scene semantics with the positive sample but differing along a single, LLM-perturbed axis of physical behavior; spanning kinematics, forces, materials, interactions, and magnitudes. A velocity-space distributional regularizer helps to prevent catastrophic forgetting of pretrained visual quality. When applied to Wan 2.1-1.3B, our method improves the physical commonsense score on VideoPhy by 16.7% and 11.3% compared to the baseline and SFT, respectively, without increasing training time.

Paper Structure

This paper contains 37 sections, 1 theorem, 13 equations, 8 figures, 8 tables.

Key Result

proposition 1

Let $\boldsymbol{\delta} = \mathbf{u}^{+} - \mathbf{u}^{-}$ denote the velocity gap between positive and negative targets. The inner product of the flow-matching and contrastive gradient directions satisfies

Figures (8)

  • Figure 1: Comparison of zeroshot Wan-2.1-1.3B against the same model trained using DiReCT on a few prompts from VideoPhy. In the top example, the baseline drives the car backward, violating forward kinematics, while DiReCT produces a consistent forward trajectory. In the middle example, the baseline's surfer progressively merges with the sail rig, losing bodily structure, whereas DiReCT maintains the surfer's integrity and mass throughout. In the bottom example, the baseline keeps the wood stationary and upright despite the flowing current, while DiReCT generates realistic downstream motion consistent with buoyancy and flow dynamics.
  • Figure 2: Overview of the DiReCT framework. The contrastive objective is decomposed into two complementary scales: a macro-contrastive term draws random negatives from semantically distant clusters (MaNS), providing clean global trajectory separation, while a micro-contrastive term uses physics-perturbed hard negatives (MiNS) that share scene semantics but violate a targeted physical dimension, enabling fine-grained physics discrimination.
  • Figure 3: Macro-contrastive negative sampling (MaNS). Prompts are encoded via the text encoder of the video generative model, globally pooled, and partitioned into $K$ semantic regions. For each positive, negatives are drawn exclusively from different partitions, ensuring minimal velocity-field overlap and preventing the gradient conflict identified in Proposition \ref{['prop:conflict']} from Appendix \ref{['append:gradient_conflict']}.
  • Figure 4: Micro-contrastive hard negative generation (MiNS). For each anchor prompt, an LLM (Qwen2.5-7B-Instruct) perturbs a single physics dimension while preserving scene semantics. The perturbed prompt is rendered by the base model to produce a hard negative video whose velocity trajectory is physically inconsistent with the anchor.
  • Figure 5: Comparison of DiReCT with CogVideoX-2B and Allegro on a prompt from VideoPhy (top) and one from WorldModelBench (bottom).
  • ...and 3 more figures

Theorems & Definitions (3)

  • proposition 1: Gradient conflict under semantic proximity
  • proof
  • remark 1: Why physics-concentrated $\boldsymbol{\delta}$ avoids conflict