Table of Contents
Fetching ...

Latent Knowledge-Guided Video Diffusion for Scientific Phenomena Generation from a Single Initial Frame

Qinglong Cao, Xirui Li, Ding Wang, Chao Ma, Yuntian Chen, Xiaokang Yang

TL;DR

This work tackles the challenge of generating scientifically plausible videos of phenomena like fluid flows and typhoon dynamics from a single frame. It introduces a latent knowledge grounding approach that decouples static structure (via a masked autoencoder) and dynamic evolution (via optical flow), then converts these cues into pseudo-language prompts using a quaternion-based projection aligned with CLIP in both spatial and frequency domains. The prompts condition a video diffusion model through LoRA-based fine-tuning, enabling physically consistent generation without language annotations. Extensive experiments on CFD simulations and real-world typhoon data show improved fidelity and physics-consistency over strong baselines, highlighting the method's potential to bridge generative video models and scientific phenomena.

Abstract

Video diffusion models have achieved impressive results in natural scene generation, yet they struggle to generalize to scientific phenomena such as fluid simulations and meteorological processes, where underlying dynamics are governed by scientific laws. These tasks pose unique challenges, including severe domain gaps, limited training data, and the lack of descriptive language annotations. To handle this dilemma, we extracted the latent scientific phenomena knowledge and further proposed a fresh framework that teaches video diffusion models to generate scientific phenomena from a single initial frame. Particularly, static knowledge is extracted via pre-trained masked autoencoders, while dynamic knowledge is derived from pre-trained optical flow prediction. Subsequently, based on the aligned spatial relations between the CLIP vision and language encoders, the visual embeddings of scientific phenomena, guided by latent scientific phenomena knowledge, are projected to generate the pseudo-language prompt embeddings in both spatial and frequency domains. By incorporating these prompts and fine-tuning the video diffusion model, we enable the generation of videos that better adhere to scientific laws. Extensive experiments on both computational fluid dynamics simulations and real-world typhoon observations demonstrate the effectiveness of our approach, achieving superior fidelity and consistency across diverse scientific scenarios.

Latent Knowledge-Guided Video Diffusion for Scientific Phenomena Generation from a Single Initial Frame

TL;DR

This work tackles the challenge of generating scientifically plausible videos of phenomena like fluid flows and typhoon dynamics from a single frame. It introduces a latent knowledge grounding approach that decouples static structure (via a masked autoencoder) and dynamic evolution (via optical flow), then converts these cues into pseudo-language prompts using a quaternion-based projection aligned with CLIP in both spatial and frequency domains. The prompts condition a video diffusion model through LoRA-based fine-tuning, enabling physically consistent generation without language annotations. Extensive experiments on CFD simulations and real-world typhoon data show improved fidelity and physics-consistency over strong baselines, highlighting the method's potential to bridge generative video models and scientific phenomena.

Abstract

Video diffusion models have achieved impressive results in natural scene generation, yet they struggle to generalize to scientific phenomena such as fluid simulations and meteorological processes, where underlying dynamics are governed by scientific laws. These tasks pose unique challenges, including severe domain gaps, limited training data, and the lack of descriptive language annotations. To handle this dilemma, we extracted the latent scientific phenomena knowledge and further proposed a fresh framework that teaches video diffusion models to generate scientific phenomena from a single initial frame. Particularly, static knowledge is extracted via pre-trained masked autoencoders, while dynamic knowledge is derived from pre-trained optical flow prediction. Subsequently, based on the aligned spatial relations between the CLIP vision and language encoders, the visual embeddings of scientific phenomena, guided by latent scientific phenomena knowledge, are projected to generate the pseudo-language prompt embeddings in both spatial and frequency domains. By incorporating these prompts and fine-tuning the video diffusion model, we enable the generation of videos that better adhere to scientific laws. Extensive experiments on both computational fluid dynamics simulations and real-world typhoon observations demonstrate the effectiveness of our approach, achieving superior fidelity and consistency across diverse scientific scenarios.

Paper Structure

This paper contains 16 sections, 28 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Our approach integrates latent scientific phenomenon knowledge into video diffusion models via parameter-efficient fine-tuning, enabling more consistent and plausible generation under data-constrained scenarios.
  • Figure 2: Overview of our proposed method. Using the MAE and optical flow prediction to extract latent physical phenomenon knowledge. Projecting CLIP vision features guided by latent physical phenomenon knowledge to obtain pseudo-language prompt embeddings. Incorporating these embeddings to generate more physically plausible physical phenomena.
  • Figure 3: Qualitative results in fluid simulation datasets. Our method, guided by latent physical knowledge, produces phenomena more consistent with physical laws.
  • Figure 4: Qualitative comparisons in true typhoon dataset. The red box denotes some hallucinations.
  • Figure 5: Qualitative comparisons in fluid simulation dataset. Though incorporating physical phenomenon knowledge, our method generates rational phenomena that exhibit better alignment with physical laws.
  • ...and 1 more figures