Table of Contents
Fetching ...

Survey of Video Diffusion Models: Foundations, Implementations, and Applications

Yimu Wang, Xuye Liu, Wei Pang, Li Ma, Shuai Yuan, Paul Debevec, Ning Yu

TL;DR

This survey comprehensively analyzes diffusion-based video generation, tracing its evolution from GANs and autoregressive approaches to modern diffusion techniques. It organizes foundational methods (DDPM, DDIM, EDM), guidance strategies, and architectural frameworks (UNet, DiT, VAEs) within a unified taxonomy, and catalogs implementations, datasets, evaluation benchmarks, and industry solutions. The paper highlights key applications across conditioning modalities, enhancement tasks, personalization, and 3D/4D generation, while also addressing ethical considerations and long-term challenges such as computational efficiency and safety. Overall, it positions diffusion-based video generation as a rapidly evolving field with broad practical impact, offering a foundational resource for researchers and practitioners and guiding future research toward efficient, physically grounded, and responsibly deployed systems.

Abstract

Recent advances in diffusion models have revolutionized video generation, offering superior temporal consistency and visual quality compared to traditional generative adversarial networks-based approaches. While this emerging field shows tremendous promise in applications, it faces significant challenges in motion consistency, computational efficiency, and ethical considerations. This survey provides a comprehensive review of diffusion-based video generation, examining its evolution, technical foundations, and practical applications. We present a systematic taxonomy of current methodologies, analyze architectural innovations and optimization strategies, and investigate applications across low-level vision tasks such as denoising and super-resolution. Additionally, we explore the synergies between diffusionbased video generation and related domains, including video representation learning, question answering, and retrieval. Compared to the existing surveys (Lei et al., 2024a;b; Melnik et al., 2024; Cao et al., 2023; Xing et al., 2024c) which focus on specific aspects of video generation, such as human video synthesis (Lei et al., 2024a) or long-form content generation (Lei et al., 2024b), our work provides a broader, more updated, and more fine-grained perspective on diffusion-based approaches with a special section for evaluation metrics, industry solutions, and training engineering techniques in video generation. This survey serves as a foundational resource for researchers and practitioners working at the intersection of diffusion models and video generation, providing insights into both the theoretical frameworks and practical implementations that drive this rapidly evolving field. A structured list of related works involved in this survey is also available on https://github.com/Eyeline-Research/Survey-Video-Diffusion.

Survey of Video Diffusion Models: Foundations, Implementations, and Applications

TL;DR

This survey comprehensively analyzes diffusion-based video generation, tracing its evolution from GANs and autoregressive approaches to modern diffusion techniques. It organizes foundational methods (DDPM, DDIM, EDM), guidance strategies, and architectural frameworks (UNet, DiT, VAEs) within a unified taxonomy, and catalogs implementations, datasets, evaluation benchmarks, and industry solutions. The paper highlights key applications across conditioning modalities, enhancement tasks, personalization, and 3D/4D generation, while also addressing ethical considerations and long-term challenges such as computational efficiency and safety. Overall, it positions diffusion-based video generation as a rapidly evolving field with broad practical impact, offering a foundational resource for researchers and practitioners and guiding future research toward efficient, physically grounded, and responsibly deployed systems.

Abstract

Recent advances in diffusion models have revolutionized video generation, offering superior temporal consistency and visual quality compared to traditional generative adversarial networks-based approaches. While this emerging field shows tremendous promise in applications, it faces significant challenges in motion consistency, computational efficiency, and ethical considerations. This survey provides a comprehensive review of diffusion-based video generation, examining its evolution, technical foundations, and practical applications. We present a systematic taxonomy of current methodologies, analyze architectural innovations and optimization strategies, and investigate applications across low-level vision tasks such as denoising and super-resolution. Additionally, we explore the synergies between diffusionbased video generation and related domains, including video representation learning, question answering, and retrieval. Compared to the existing surveys (Lei et al., 2024a;b; Melnik et al., 2024; Cao et al., 2023; Xing et al., 2024c) which focus on specific aspects of video generation, such as human video synthesis (Lei et al., 2024a) or long-form content generation (Lei et al., 2024b), our work provides a broader, more updated, and more fine-grained perspective on diffusion-based approaches with a special section for evaluation metrics, industry solutions, and training engineering techniques in video generation. This survey serves as a foundational resource for researchers and practitioners working at the intersection of diffusion models and video generation, providing insights into both the theoretical frameworks and practical implementations that drive this rapidly evolving field. A structured list of related works involved in this survey is also available on https://github.com/Eyeline-Research/Survey-Video-Diffusion.

Paper Structure

This paper contains 66 sections, 15 equations, 26 figures, 2 tables, 4 algorithms.

Figures (26)

  • Figure 1: Overview of video generation methods. Generally, the input conditions can be noises, images, videos, audios, texts, and 3D point clouds. The architectures (UNet, VAE, and/or Transformers) are trained using GAN or diffusion models with training data of real-world data or synthetic data with different paradigms. The applications are in several folds, e.g., video personalization, consistency-aware generation, and long video generalization. On the other side, video generation models can also benefit other video tasks, such as, video retrieval, understanding, and representation learning.
  • Figure 2: The directed graphical model for DDPMho2020denoising
  • Figure 3: An example for optical flow usage in video scenario, Go-with-the-Flow burgert2025go, a novel framework that predicts 3D scene dynamics across sequential frames, outperforming conventional single-frame approaches. By integrating multi-frame spatial-temporal relationships, it improves depth accuracy and visual fidelity in scene reconstruction.
  • Figure 4: A pipeline for diffusion-based visual content generation leverages a pre-trained variational autoencoder (VAE), such as 2D VAE, 3D VAE, VQGAN, or TAE, to encode input images or videos into a lower-dimensional latent representation. Within this latent space, diffusion models (e.g., DDPM, DDIM, EDM) iteratively introduce noise and employ neural architectures, such as U-Net or Transformer-based models, to learn a denoising process that reconstructs high-fidelity outputs. User-provided textual prompts undergo refinement through large language models (e.g., T5, CLIP, GPT) before being mapped into an embedding space via a trained text encoder. This embedding space serves as a conditioning mechanism, guiding the diffusion process to ensure semantic coherence with the input prompt. Furthermore, the framework integrates optical flow estimation methods (e.g., FlowNet, Raft) to enhance motion consistency in generated video sequences and incorporates human feedback mechanisms (e.g., VIDEORM) to iteratively improve generation quality.
  • Figure 5: Classic temporal attention design from blattmann2023align, showing how temporal attention mechanisms improve video generation quality by maintaining temporal coherence across frames.
  • ...and 21 more figures