Table of Contents
Fetching ...

WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation

Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin, Xiaodan Liang

TL;DR

WISA addresses the mismatch between abstract physical laws and visual video generation by decomposing physics into textual descriptions, qualitative categories, and quantitative properties. It introduces Mixture-of-Physical-Experts Attention (MoPA) and a Physical Classifier to inject physics guidance into diffusion-based T2V models, coupled with a new WISA-32K dataset of 32K clips across dynamics, thermodynamics, and optics. Empirical results show improved physical law consistency and semantic fidelity with only a small runtime and parameter overhead, validated on VideoCon-Physics, PhyGenBench, and qualitative comparisons. The work provides a scalable framework for physics-aware video synthesis and offers resources and methods to push toward robust world simulators.

Abstract

Recent rapid advancements in text-to-video (T2V) generation, such as SoRA and Kling, have shown great potential for building world simulators. However, current T2V models struggle to grasp abstract physical principles and generate videos that adhere to physical laws. This challenge arises primarily from a lack of clear guidance on physical information due to a significant gap between abstract physical principles and generation models. To this end, we introduce the World Simulator Assistant (WISA), an effective framework for decomposing and incorporating physical principles into T2V models. Specifically, WISA decomposes physical principles into textual physical descriptions, qualitative physical categories, and quantitative physical properties. To effectively embed these physical attributes into the generation process, WISA incorporates several key designs, including Mixture-of-Physical-Experts Attention (MoPA) and a Physical Classifier, enhancing the model's physics awareness. Furthermore, most existing datasets feature videos where physical phenomena are either weakly represented or entangled with multiple co-occurring processes, limiting their suitability as dedicated resources for learning explicit physical principles. We propose a novel video dataset, WISA-32K, collected based on qualitative physical categories. It consists of 32,000 videos, representing 17 physical laws across three domains of physics: dynamics, thermodynamics, and optics. Experimental results demonstrate that WISA can effectively enhance the compatibility of T2V models with real-world physical laws, achieving a considerable improvement on the VideoPhy benchmark. The visual exhibitions of WISA and WISA-32K are available in the https://360cvgroup.github.io/WISA/.

WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation

TL;DR

WISA addresses the mismatch between abstract physical laws and visual video generation by decomposing physics into textual descriptions, qualitative categories, and quantitative properties. It introduces Mixture-of-Physical-Experts Attention (MoPA) and a Physical Classifier to inject physics guidance into diffusion-based T2V models, coupled with a new WISA-32K dataset of 32K clips across dynamics, thermodynamics, and optics. Empirical results show improved physical law consistency and semantic fidelity with only a small runtime and parameter overhead, validated on VideoCon-Physics, PhyGenBench, and qualitative comparisons. The work provides a scalable framework for physics-aware video synthesis and offers resources and methods to push toward robust world simulators.

Abstract

Recent rapid advancements in text-to-video (T2V) generation, such as SoRA and Kling, have shown great potential for building world simulators. However, current T2V models struggle to grasp abstract physical principles and generate videos that adhere to physical laws. This challenge arises primarily from a lack of clear guidance on physical information due to a significant gap between abstract physical principles and generation models. To this end, we introduce the World Simulator Assistant (WISA), an effective framework for decomposing and incorporating physical principles into T2V models. Specifically, WISA decomposes physical principles into textual physical descriptions, qualitative physical categories, and quantitative physical properties. To effectively embed these physical attributes into the generation process, WISA incorporates several key designs, including Mixture-of-Physical-Experts Attention (MoPA) and a Physical Classifier, enhancing the model's physics awareness. Furthermore, most existing datasets feature videos where physical phenomena are either weakly represented or entangled with multiple co-occurring processes, limiting their suitability as dedicated resources for learning explicit physical principles. We propose a novel video dataset, WISA-32K, collected based on qualitative physical categories. It consists of 32,000 videos, representing 17 physical laws across three domains of physics: dynamics, thermodynamics, and optics. Experimental results demonstrate that WISA can effectively enhance the compatibility of T2V models with real-world physical laws, achieving a considerable improvement on the VideoPhy benchmark. The visual exhibitions of WISA and WISA-32K are available in the https://360cvgroup.github.io/WISA/.

Paper Structure

This paper contains 29 sections, 3 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Overview of our physical dataset WISA-32K. (Left) Examples of 17 physical phenomena across 3 physics categories in WISA-32K. (Top right) WISA-32K contains of approximately 32,000 video clips, with 47% related to Dynamics, 24% to Thermodynamics, and 29% to Optics. (Bottom right) Distribution of frame counts across all videos in WISA-32K.
  • Figure 2: Comparison between general scene videos in Koala-36M and videos with distinct physical phenomena in WISA-32K.
  • Figure 3: Pipeline of WISA-32K. We first define 17 common physical phenomena and, based on this, manually collect 32,000 video samples that clearly illustrate these phenomena. Then, we perform shot detection and aesthetic filtering on the raw videos. Text description are extracted using Qwen2-VL, and detailed physical annotations are generated with GPT-4o mini.
  • Figure 4: Overview of the proposed WISA. WISA introduces the Physical Module and Physical Classifier, which leverage structured physical annotations to guide and assist T2V models in generating physics-aware videos. Specifically, for qualitative physical categories, WISA constructs a Mixture-of-Physical-Experts Attention within the Physical Module, where each attention head corresponds to a specific physical phenomenon. The relevant physical expert is activated by the input qualitative physical category. The Physical Classifier predicts the physical categories relevant to the video and is supervised by inputted categories to understand abstract physical principles.
  • Figure 5: Qualitative comparison between WISA and existing T2V methods. WISA exhibit better alignment with real-world physical laws.
  • ...and 10 more figures