Table of Contents
Fetching ...

Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices

Zhiyuan Ma, Yuzhu Zhang, Guoli Jia, Liangliang Zhao, Yichao Ma, Mingjie Ma, Gaofeng Liu, Kaiyan Zhang, Jianjun Li, Bowen Zhou

TL;DR

A new efficiency-oriented perspective on existing efforts on diffusion models is provided, which mainly focuses on the profound principles and efficient practices in architecture designs, model training, fast inference and reliable deployment, to guide further theoretical research, algorithm migration and model application for new scenarios in a reader-friendly way.

Abstract

As one of the most popular and sought-after generative models in the recent years, diffusion models have sparked the interests of many researchers and steadily shown excellent advantage in various generative tasks such as image synthesis, video generation, molecule design, 3D scene rendering and multimodal generation, relying on their dense theoretical principles and reliable application practices. The remarkable success of these recent efforts on diffusion models comes largely from progressive design principles and efficient architecture, training, inference, and deployment methodologies. However, there has not been a comprehensive and in-depth review to summarize these principles and practices to help the rapid understanding and application of diffusion models. In this survey, we provide a new efficiency-oriented perspective on these existing efforts, which mainly focuses on the profound principles and efficient practices in architecture designs, model training, fast inference and reliable deployment, to guide further theoretical research, algorithm migration and model application for new scenarios in a reader-friendly way. \url{https://github.com/ponyzym/Efficient-DMs-Survey}

Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices

TL;DR

A new efficiency-oriented perspective on existing efforts on diffusion models is provided, which mainly focuses on the profound principles and efficient practices in architecture designs, model training, fast inference and reliable deployment, to guide further theoretical research, algorithm migration and model application for new scenarios in a reader-friendly way.

Abstract

As one of the most popular and sought-after generative models in the recent years, diffusion models have sparked the interests of many researchers and steadily shown excellent advantage in various generative tasks such as image synthesis, video generation, molecule design, 3D scene rendering and multimodal generation, relying on their dense theoretical principles and reliable application practices. The remarkable success of these recent efforts on diffusion models comes largely from progressive design principles and efficient architecture, training, inference, and deployment methodologies. However, there has not been a comprehensive and in-depth review to summarize these principles and practices to help the rapid understanding and application of diffusion models. In this survey, we provide a new efficiency-oriented perspective on these existing efforts, which mainly focuses on the profound principles and efficient practices in architecture designs, model training, fast inference and reliable deployment, to guide further theoretical research, algorithm migration and model application for new scenarios in a reader-friendly way. \url{https://github.com/ponyzym/Efficient-DMs-Survey}

Paper Structure

This paper contains 39 sections, 19 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: The timeline of efficient DMs.
  • Figure 2: Organization of efficient diffusion models advancements.
  • Figure 3: A universal pipeline of the diffusion based models for visual content generation. A pre-trained VAE (with encoder and decoder structures) compresses the input image or video into a latent space. Diffusion models add noise to the latent features and train a neural network (e.g. U-Net or Transformer) for de-noising. User-input text instructions are refined by a large language model and then encoded by a trained text encoder into an embedding space, which is injected into the diffusion model to control content generation.
  • Figure 4: A standard encoder-decoder architecture of 3D Variational Autoencoders (VAEs) are utilized for video compression.
  • Figure 5: The mainstream neural network backbones serving as denoisers in diffusion models, which including U-shaped denoising networks (U-Net based and U-ViT based) and F-shaped denoising networks (DiT-based and SSM-based).
  • ...and 12 more figures