Video Diffusion Models: A Survey

Andrew Melnik; Michal Ljubljanac; Cong Lu; Qi Yan; Weiming Ren; Helge Ritter

Video Diffusion Models: A Survey

Andrew Melnik, Michal Ljubljanac, Cong Lu, Qi Yan, Weiming Ren, Helge Ritter

TL;DR

Video diffusion models have emerged as a versatile tool for synthetic video creation and editing, but face challenges in temporal consistency, long-horizon generation, and data efficiency. This survey systematically reviews formulations (DDPM/Score), architectures (UNet/ViT/LDM/CDM), temporal dynamics, training/evaluation practices, and a broad spectrum of applications including text-to-video, image-conditioned generation, video editing, and multimodal synthesis, while highlighting benchmarks and ethical considerations. It also discusses long-video generation strategies, post-editing methods, and potential directions like flow-matching models and diffusion-based decision-making. The work aims to serve researchers and practitioners by consolidating state-of-the-art developments and outlining practical challenges and future opportunities.

Abstract

Diffusion generative models have recently become a powerful technique for creating and modifying high-quality, coherent video content. This survey provides a comprehensive overview of the critical components of diffusion models for video generation, including their applications, architectural design, and temporal dynamics modeling. The paper begins by discussing the core principles and mathematical formulations, then explores various architectural choices and methods for maintaining temporal consistency. A taxonomy of applications is presented, categorizing models based on input modalities such as text prompts, images, videos, and audio signals. Advancements in text-to-video generation are discussed to illustrate the state-of-the-art capabilities and limitations of current approaches. Additionally, the survey summarizes recent developments in training and evaluation practices, including the use of diverse video and image datasets and the adoption of various evaluation metrics to assess model performance. The survey concludes with an examination of ongoing challenges, such as generating longer videos and managing computational costs, and offers insights into potential future directions for the field. By consolidating the latest research and developments, this survey aims to serve as a valuable resource for researchers and practitioners working with video diffusion models. Website: https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models

Video Diffusion Models: A Survey

TL;DR

Abstract

Paper Structure (48 sections, 18 equations, 13 figures, 4 tables)

This paper contains 48 sections, 18 equations, 13 figures, 4 tables.

Introduction
Taxonomy of Applications
Mathematical Formulation
Denoising Diffusion Probabilistic Model (DDPM) Formulation
Score-based Model Formulation
Architecture
UNet
Vision Transformer
Cascaded Diffusion Models
Latent Diffusion Models
Temporal Dynamics
Spatio-Temporal Attention Mechanisms
Temporal Upsampling
Structure Preservation
Training & Evaluation
...and 33 more sections

Figures (13)

Figure 1: Overview of the key aspects of video diffusion models that we cover in this survey.
Figure 2: Applications of video diffusion models. Bounding boxes are clickable links to relevant chapters. Example images taken from the following papers (top to bottom): blattmann2023align, ho2022imagen, singer2022make, lu2023vdt, yin2023nuwa, lee2023aadiff, stypulkowski2023diffused, wu2022tune, xing2023make, ma2023follow, liu2023color
Figure 3: Diffusion models add noise to the observed data in the forward process and are trained to learn how to reverse the process. Denoising diffusion probabilistic model (DDPM) and score-based model (SBM) are popular approaches that are mathematically equivalent but provide different perspectives.
Figure 4: The denoising UNet architecture typically used in text-to-image diffusion models. The model iteratively predicts a denoised version of the noisy input image. The image is processed through a number of encoding layers and the same number of decoding layers that are linked through residual connections. Each layer consists of ResNet blocks implementing convolutions, as well as Vision Transformer self-attention and cross-attention blocks. Self-attention shares information across image patches, while cross-attention conditions the denoising process on text prompts.
Figure 5: Limitations of text-to-video diffusion models for generating consistent videos. (Top) When using only a text prompt ("Michael Jordan running"), both the appearance and position of objects change wildly between video frames. (Bottom) Conditioning on spatial information from a reference video can produce consistent movement, but the appearance of objects and the background still fluctuate between video frames.
...and 8 more figures

Video Diffusion Models: A Survey

TL;DR

Abstract

Video Diffusion Models: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (13)