Video Diffusion Models: A Survey
Andrew Melnik, Michal Ljubljanac, Cong Lu, Qi Yan, Weiming Ren, Helge Ritter
TL;DR
Video diffusion models have emerged as a versatile tool for synthetic video creation and editing, but face challenges in temporal consistency, long-horizon generation, and data efficiency. This survey systematically reviews formulations (DDPM/Score), architectures (UNet/ViT/LDM/CDM), temporal dynamics, training/evaluation practices, and a broad spectrum of applications including text-to-video, image-conditioned generation, video editing, and multimodal synthesis, while highlighting benchmarks and ethical considerations. It also discusses long-video generation strategies, post-editing methods, and potential directions like flow-matching models and diffusion-based decision-making. The work aims to serve researchers and practitioners by consolidating state-of-the-art developments and outlining practical challenges and future opportunities.
Abstract
Diffusion generative models have recently become a powerful technique for creating and modifying high-quality, coherent video content. This survey provides a comprehensive overview of the critical components of diffusion models for video generation, including their applications, architectural design, and temporal dynamics modeling. The paper begins by discussing the core principles and mathematical formulations, then explores various architectural choices and methods for maintaining temporal consistency. A taxonomy of applications is presented, categorizing models based on input modalities such as text prompts, images, videos, and audio signals. Advancements in text-to-video generation are discussed to illustrate the state-of-the-art capabilities and limitations of current approaches. Additionally, the survey summarizes recent developments in training and evaluation practices, including the use of diverse video and image datasets and the adoption of various evaluation metrics to assess model performance. The survey concludes with an examination of ongoing challenges, such as generating longer videos and managing computational costs, and offers insights into potential future directions for the field. By consolidating the latest research and developments, this survey aims to serve as a valuable resource for researchers and practitioners working with video diffusion models. Website: https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models
