Table of Contents
Fetching ...

A Physical Coherence Benchmark for Evaluating Video Generation Models via Optical Flow-guided Frame Prediction

Yongfan Chen, Xiuwen Zhu, Tianyu Li

TL;DR

This work addresses the need for evaluating physical coherence in video generation by introducing PhyCoBench, a benchmark with 120 prompts across seven physical principles, paired with human rankings of four state-of-the-art T2V models. It proposes PhyCoPredictor, a two-stage latent diffusion framework guided by optical flow to predict future motion and frames, enabling automatic evaluation via flow and video consistency. Quantitative and qualitative results show that PhyCoPredictor rankings align with human judgments and improve over baselines, demonstrating its utility for benchmarking and guiding physical-coherence improvements in video generation. The authors release PhyCoBench prompts, PhyCoPredictor, and associated data on GitHub to facilitate ongoing research in physically plausible video synthesis.

Abstract

Recent advances in video generation models demonstrate their potential as world simulators, but they often struggle with videos deviating from physical laws, a key concern overlooked by most text-to-video benchmarks. We introduce a benchmark designed specifically to assess the Physical Coherence of generated videos, PhyCoBench. Our benchmark includes 120 prompts covering 7 categories of physical principles, capturing key physical laws observable in video content. We evaluated four state-of-the-art (SoTA) T2V models on PhyCoBench and conducted manual assessments. Additionally, we propose an automated evaluation model: PhyCoPredictor, a diffusion model that generates optical flow and video frames in a cascade manner. Through a consistency evaluation comparing automated and manual sorting, the experimental results show that PhyCoPredictor currently aligns most closely with human evaluation. Therefore, it can effectively evaluate the physical coherence of videos, providing insights for future model optimization. Our benchmark, including physical coherence prompts, the automatic evaluation tool PhyCoPredictor, and the generated video dataset, has been released on GitHub at https://github.com/Jeckinchen/PhyCoBench.

A Physical Coherence Benchmark for Evaluating Video Generation Models via Optical Flow-guided Frame Prediction

TL;DR

This work addresses the need for evaluating physical coherence in video generation by introducing PhyCoBench, a benchmark with 120 prompts across seven physical principles, paired with human rankings of four state-of-the-art T2V models. It proposes PhyCoPredictor, a two-stage latent diffusion framework guided by optical flow to predict future motion and frames, enabling automatic evaluation via flow and video consistency. Quantitative and qualitative results show that PhyCoPredictor rankings align with human judgments and improve over baselines, demonstrating its utility for benchmarking and guiding physical-coherence improvements in video generation. The authors release PhyCoBench prompts, PhyCoPredictor, and associated data on GitHub to facilitate ongoing research in physically plausible video synthesis.

Abstract

Recent advances in video generation models demonstrate their potential as world simulators, but they often struggle with videos deviating from physical laws, a key concern overlooked by most text-to-video benchmarks. We introduce a benchmark designed specifically to assess the Physical Coherence of generated videos, PhyCoBench. Our benchmark includes 120 prompts covering 7 categories of physical principles, capturing key physical laws observable in video content. We evaluated four state-of-the-art (SoTA) T2V models on PhyCoBench and conducted manual assessments. Additionally, we propose an automated evaluation model: PhyCoPredictor, a diffusion model that generates optical flow and video frames in a cascade manner. Through a consistency evaluation comparing automated and manual sorting, the experimental results show that PhyCoPredictor currently aligns most closely with human evaluation. Therefore, it can effectively evaluate the physical coherence of videos, providing insights for future model optimization. Our benchmark, including physical coherence prompts, the automatic evaluation tool PhyCoPredictor, and the generated video dataset, has been released on GitHub at https://github.com/Jeckinchen/PhyCoBench.

Paper Structure

This paper contains 28 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of Our Benchmark. We propose PhyCoBench—a benchmark specifically designed to evaluate text-to-video (T2V) models in generating physically coherent videos. We categorize common physical scenarios into seven types and create a comprehensive set of prompts. With these prompts, we generate test set videos with four T2V models and conduct human rankings. We also introduce PhyCoPredictor, an optical flow-guided frame prediction model designed for automatic evaluation. Using the first frame and corresponding prompt of each video in the test set as input, we generate reference optical flow and videos with PhyCoPredictor. These are compared with the test set videos and their computed optical flows, producing scores to rank model performance. Correlation analysis shows that our automated evaluation results are closely aligned with human preferences.
  • Figure 2: The proportion of text prompts. Our prompts are grouped into seven types.
  • Figure 3: Generated video examples of T2V models. The videos generated by these four models do not consistently adhere to physical coherence, with varying levels of quality.
  • Figure 4: Overall ranking result from manual evaluation.
  • Figure 5: Category-specific ranking results from manual evaluation.
  • ...and 3 more figures