Latent Knowledge-Guided Video Diffusion for Scientific Phenomena Generation from a Single Initial Frame
Qinglong Cao, Xirui Li, Ding Wang, Chao Ma, Yuntian Chen, Xiaokang Yang
TL;DR
This work tackles the challenge of generating scientifically plausible videos of phenomena like fluid flows and typhoon dynamics from a single frame. It introduces a latent knowledge grounding approach that decouples static structure (via a masked autoencoder) and dynamic evolution (via optical flow), then converts these cues into pseudo-language prompts using a quaternion-based projection aligned with CLIP in both spatial and frequency domains. The prompts condition a video diffusion model through LoRA-based fine-tuning, enabling physically consistent generation without language annotations. Extensive experiments on CFD simulations and real-world typhoon data show improved fidelity and physics-consistency over strong baselines, highlighting the method's potential to bridge generative video models and scientific phenomena.
Abstract
Video diffusion models have achieved impressive results in natural scene generation, yet they struggle to generalize to scientific phenomena such as fluid simulations and meteorological processes, where underlying dynamics are governed by scientific laws. These tasks pose unique challenges, including severe domain gaps, limited training data, and the lack of descriptive language annotations. To handle this dilemma, we extracted the latent scientific phenomena knowledge and further proposed a fresh framework that teaches video diffusion models to generate scientific phenomena from a single initial frame. Particularly, static knowledge is extracted via pre-trained masked autoencoders, while dynamic knowledge is derived from pre-trained optical flow prediction. Subsequently, based on the aligned spatial relations between the CLIP vision and language encoders, the visual embeddings of scientific phenomena, guided by latent scientific phenomena knowledge, are projected to generate the pseudo-language prompt embeddings in both spatial and frequency domains. By incorporating these prompts and fine-tuning the video diffusion model, we enable the generation of videos that better adhere to scientific laws. Extensive experiments on both computational fluid dynamics simulations and real-world typhoon observations demonstrate the effectiveness of our approach, achieving superior fidelity and consistency across diverse scientific scenarios.
