Table of Contents
Fetching ...

Video Diffusion Models: A Survey

Andrew Melnik, Michal Ljubljanac, Cong Lu, Qi Yan, Weiming Ren, Helge Ritter

TL;DR

Video diffusion models have emerged as a versatile tool for synthetic video creation and editing, but face challenges in temporal consistency, long-horizon generation, and data efficiency. This survey systematically reviews formulations (DDPM/Score), architectures (UNet/ViT/LDM/CDM), temporal dynamics, training/evaluation practices, and a broad spectrum of applications including text-to-video, image-conditioned generation, video editing, and multimodal synthesis, while highlighting benchmarks and ethical considerations. It also discusses long-video generation strategies, post-editing methods, and potential directions like flow-matching models and diffusion-based decision-making. The work aims to serve researchers and practitioners by consolidating state-of-the-art developments and outlining practical challenges and future opportunities.

Abstract

Diffusion generative models have recently become a powerful technique for creating and modifying high-quality, coherent video content. This survey provides a comprehensive overview of the critical components of diffusion models for video generation, including their applications, architectural design, and temporal dynamics modeling. The paper begins by discussing the core principles and mathematical formulations, then explores various architectural choices and methods for maintaining temporal consistency. A taxonomy of applications is presented, categorizing models based on input modalities such as text prompts, images, videos, and audio signals. Advancements in text-to-video generation are discussed to illustrate the state-of-the-art capabilities and limitations of current approaches. Additionally, the survey summarizes recent developments in training and evaluation practices, including the use of diverse video and image datasets and the adoption of various evaluation metrics to assess model performance. The survey concludes with an examination of ongoing challenges, such as generating longer videos and managing computational costs, and offers insights into potential future directions for the field. By consolidating the latest research and developments, this survey aims to serve as a valuable resource for researchers and practitioners working with video diffusion models. Website: https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models

Video Diffusion Models: A Survey

TL;DR

Video diffusion models have emerged as a versatile tool for synthetic video creation and editing, but face challenges in temporal consistency, long-horizon generation, and data efficiency. This survey systematically reviews formulations (DDPM/Score), architectures (UNet/ViT/LDM/CDM), temporal dynamics, training/evaluation practices, and a broad spectrum of applications including text-to-video, image-conditioned generation, video editing, and multimodal synthesis, while highlighting benchmarks and ethical considerations. It also discusses long-video generation strategies, post-editing methods, and potential directions like flow-matching models and diffusion-based decision-making. The work aims to serve researchers and practitioners by consolidating state-of-the-art developments and outlining practical challenges and future opportunities.

Abstract

Diffusion generative models have recently become a powerful technique for creating and modifying high-quality, coherent video content. This survey provides a comprehensive overview of the critical components of diffusion models for video generation, including their applications, architectural design, and temporal dynamics modeling. The paper begins by discussing the core principles and mathematical formulations, then explores various architectural choices and methods for maintaining temporal consistency. A taxonomy of applications is presented, categorizing models based on input modalities such as text prompts, images, videos, and audio signals. Advancements in text-to-video generation are discussed to illustrate the state-of-the-art capabilities and limitations of current approaches. Additionally, the survey summarizes recent developments in training and evaluation practices, including the use of diverse video and image datasets and the adoption of various evaluation metrics to assess model performance. The survey concludes with an examination of ongoing challenges, such as generating longer videos and managing computational costs, and offers insights into potential future directions for the field. By consolidating the latest research and developments, this survey aims to serve as a valuable resource for researchers and practitioners working with video diffusion models. Website: https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models
Paper Structure (48 sections, 18 equations, 13 figures, 4 tables)

This paper contains 48 sections, 18 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Overview of the key aspects of video diffusion models that we cover in this survey.
  • Figure 2: Applications of video diffusion models. Bounding boxes are clickable links to relevant chapters. Example images taken from the following papers (top to bottom): blattmann2023align, ho2022imagen, singer2022make, lu2023vdt, yin2023nuwa, lee2023aadiff, stypulkowski2023diffused, wu2022tune, xing2023make, ma2023follow, liu2023color
  • Figure 3: Diffusion models add noise to the observed data in the forward process and are trained to learn how to reverse the process. Denoising diffusion probabilistic model (DDPM) and score-based model (SBM) are popular approaches that are mathematically equivalent but provide different perspectives.
  • Figure 4: The denoising UNet architecture typically used in text-to-image diffusion models. The model iteratively predicts a denoised version of the noisy input image. The image is processed through a number of encoding layers and the same number of decoding layers that are linked through residual connections. Each layer consists of ResNet blocks implementing convolutions, as well as Vision Transformer self-attention and cross-attention blocks. Self-attention shares information across image patches, while cross-attention conditions the denoising process on text prompts.
  • Figure 5: Limitations of text-to-video diffusion models for generating consistent videos. (Top) When using only a text prompt ("Michael Jordan running"), both the appearance and position of objects change wildly between video frames. (Bottom) Conditioning on spatial information from a reference video can produce consistent movement, but the appearance of objects and the background still fluctuate between video frames.
  • ...and 8 more figures