Table of Contents
Fetching ...

Exploring the Evolution of Physics Cognition in Video Generation: A Survey

Minghui Lin, Xiang Wang, Yishan Wang, Shu Wang, Fengqi Dai, Pengxiang Ding, Cunxiang Wang, Zhengrong Zuo, Nong Sang, Siteng Huang, Donglin Wang

TL;DR

This survey addresses the gap in physically faithful video generation by proposing a cognitive-science–inspired three-tier taxonomy that spans from basic schematic perception to passive physical knowledge and active world simulation. It catalogs architectural designs, physics simulators, and LLM-enabled reasoning across 2D/3D/4D video tasks, emphasizing interpretable, controllable, and physically consistent generation. Key contributions include a structured evolutionary framework, comprehensive coverage of benchmarks and metrics, and a forward-looking discussion of large physics foundation models, multi-sensor integration, and efficient physics-aware world modeling. The work aims to move video generation from mere visual realism toward human-like physical understanding and reasoning, with implications for robotics, autonomous systems, and AGI development.

Abstract

Recent advancements in video generation have witnessed significant progress, especially with the rapid advancement of diffusion models. Despite this, their deficiencies in physical cognition have gradually received widespread attention - generated content often violates the fundamental laws of physics, falling into the dilemma of ''visual realism but physical absurdity". Researchers began to increasingly recognize the importance of physical fidelity in video generation and attempted to integrate heuristic physical cognition such as motion representations and physical knowledge into generative systems to simulate real-world dynamic scenarios. Considering the lack of a systematic overview in this field, this survey aims to provide a comprehensive summary of architecture designs and their applications to fill this gap. Specifically, we discuss and organize the evolutionary process of physical cognition in video generation from a cognitive science perspective, while proposing a three-tier taxonomy: 1) basic schema perception for generation, 2) passive cognition of physical knowledge for generation, and 3) active cognition for world simulation, encompassing state-of-the-art methods, classical paradigms, and benchmarks. Subsequently, we emphasize the inherent key challenges in this domain and delineate potential pathways for future research, contributing to advancing the frontiers of discussion in both academia and industry. Through structured review and interdisciplinary analysis, this survey aims to provide directional guidance for developing interpretable, controllable, and physically consistent video generation paradigms, thereby propelling generative models from the stage of ''visual mimicry'' towards a new phase of ''human-like physical comprehension''.

Exploring the Evolution of Physics Cognition in Video Generation: A Survey

TL;DR

This survey addresses the gap in physically faithful video generation by proposing a cognitive-science–inspired three-tier taxonomy that spans from basic schematic perception to passive physical knowledge and active world simulation. It catalogs architectural designs, physics simulators, and LLM-enabled reasoning across 2D/3D/4D video tasks, emphasizing interpretable, controllable, and physically consistent generation. Key contributions include a structured evolutionary framework, comprehensive coverage of benchmarks and metrics, and a forward-looking discussion of large physics foundation models, multi-sensor integration, and efficient physics-aware world modeling. The work aims to move video generation from mere visual realism toward human-like physical understanding and reasoning, with implications for robotics, autonomous systems, and AGI development.

Abstract

Recent advancements in video generation have witnessed significant progress, especially with the rapid advancement of diffusion models. Despite this, their deficiencies in physical cognition have gradually received widespread attention - generated content often violates the fundamental laws of physics, falling into the dilemma of ''visual realism but physical absurdity". Researchers began to increasingly recognize the importance of physical fidelity in video generation and attempted to integrate heuristic physical cognition such as motion representations and physical knowledge into generative systems to simulate real-world dynamic scenarios. Considering the lack of a systematic overview in this field, this survey aims to provide a comprehensive summary of architecture designs and their applications to fill this gap. Specifically, we discuss and organize the evolutionary process of physical cognition in video generation from a cognitive science perspective, while proposing a three-tier taxonomy: 1) basic schema perception for generation, 2) passive cognition of physical knowledge for generation, and 3) active cognition for world simulation, encompassing state-of-the-art methods, classical paradigms, and benchmarks. Subsequently, we emphasize the inherent key challenges in this domain and delineate potential pathways for future research, contributing to advancing the frontiers of discussion in both academia and industry. Through structured review and interdisciplinary analysis, this survey aims to provide directional guidance for developing interpretable, controllable, and physically consistent video generation paradigms, thereby propelling generative models from the stage of ''visual mimicry'' towards a new phase of ''human-like physical comprehension''.

Paper Structure

This paper contains 36 sections, 7 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Video cases generated by three typical state-of-the-art generative video modelssorayang2024cogvideoxagarwal2025cosmos. We can observe that these advanced models still struggle to produce satisfying videos that strictly conform to physical laws.
  • Figure 2: Cognitive evolution processes of individuals and generation system.
  • Figure 3: The taxonomy of PhysGenBenchmeng2024towards benchmark, including 4 physical commonsense and 27 physical laws.
  • Figure 4: Introduction to mainstream generative models: GANsgoodfellow2020generative, Diffusion Modelsho2020denoising, NeRFmildenhall2021nerf, Gaussian Splattingkerbl3Dgaussians.
  • Figure 5: Overview of the evolution of physical cognition in video generation. Please note that the typical methods listed here cover only a subset of the relevant literature and do not represent all existing studies.
  • ...and 8 more figures