Table of Contents
Fetching ...

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng

TL;DR

The paper addresses the gap in physical plausibility for text-to-video generation by transferring physics understanding from video foundation models to diffusion-based video generators. It introduces Token Relation Distillation (TRD), a relatio nal alignment objective that distills spatio-temporal token relationships to guide generation without explicit physics simulators. Empirical results on VideoPhy and VideoPhy2 benchmarks show substantial gains over baselines like CogVideoX and WISA, including generalization to open-domain data. The approach offers a practical pathway to more physically coherent video synthesis and motivates further exploration of physics-informed pre-training for diffusion models.

Abstract

Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

TL;DR

The paper addresses the gap in physical plausibility for text-to-video generation by transferring physics understanding from video foundation models to diffusion-based video generators. It introduces Token Relation Distillation (TRD), a relatio nal alignment objective that distills spatio-temporal token relationships to guide generation without explicit physics simulators. Empirical results on VideoPhy and VideoPhy2 benchmarks show substantial gains over baselines like CogVideoX and WISA, including generalization to open-domain data. The approach offers a practical pathway to more physically coherent video synthesis and motivates further exploration of physics-informed pre-training for diffusion models.

Abstract

Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.

Paper Structure

This paper contains 19 sections, 5 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Left: Visual comparison of video generation results from CogVideoX cogvideox (baseline), CogVideoX finetuned with REPA yu2024REPA, and our proposed VideoREPA. Red rectangles denote phenomena that violate physical commonsense for easier distinguish. Our VideoREPA generates videos that most closely adhere to real-world physical laws.Right: Evaluation of physics understanding on the Object Contact Prediction (OCP) task within the Physion benchmark physion. The plots illustrate a significant gap in physics understanding between the SSL video encoder VideoMAEv2 and the T2V model CogVideoX. The proposed VideoREPA substantially narrows this understanding gap.
  • Figure 2: Overview of VideoREPA. Our VideoREPA enhances physics in T2V models by distilling physics knowledge from pre-trained SSL video encoders. We apply Token Relation Distillation (TRD) loss to align pairwise token similarities between video SSL representations and intermediate features in diffusion transformer blocks. Within each representation, tokens form spatial relations with other tokens in the same latent frame and temporal relations with tokens in other latent frames.
  • Figure 3: Qualitative comparison of HunyuanVideo (HY)hunyuanvideo, CogVideoX (Cog)cogvideox, and VideoREPA (Ours), exhibiting enhanced physics commonsense of VideoREPA.
  • Figure 4: Ablation on REPA loss.
  • Figure 5: Left: Effect of alignment depth. Right: Effect of $\lambda$. PC score on VideoPhy is reported.
  • ...and 5 more figures