Table of Contents
Fetching ...

Image Generation Models: A Technical History

Rouzbeh Shirvani

TL;DR

This paper aims to offer a comprehensive survey of breakthrough image generation models, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods.

Abstract

Image generation has advanced rapidly over the past decade, yet the literature seems fragmented across different models and application domains. This paper aims to offer a comprehensive survey of breakthrough image generation models, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods. We provide a detailed technical walkthrough of each model type, including their underlying objectives, architectural building blocks, and algorithmic training steps. For each model type, we present the optimization techniques as well as common failure modes and limitations. We also go over recent developments in video generation and present the research works that made it possible to go from still frames to high quality videos. Lastly, we cover the growing importance of robustness and responsible deployment of these models, including deepfake risks, detection, artifacts, and watermarking.

Image Generation Models: A Technical History

TL;DR

This paper aims to offer a comprehensive survey of breakthrough image generation models, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods.

Abstract

Image generation has advanced rapidly over the past decade, yet the literature seems fragmented across different models and application domains. This paper aims to offer a comprehensive survey of breakthrough image generation models, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods. We provide a detailed technical walkthrough of each model type, including their underlying objectives, architectural building blocks, and algorithmic training steps. For each model type, we present the optimization techniques as well as common failure modes and limitations. We also go over recent developments in video generation and present the research works that made it possible to go from still frames to high quality videos. Lastly, we cover the growing importance of robustness and responsible deployment of these models, including deepfake risks, detection, artifacts, and watermarking.
Paper Structure (52 sections, 140 equations, 101 figures, 15 tables)

This paper contains 52 sections, 140 equations, 101 figures, 15 tables.

Figures (101)

  • Figure 1: Comparison of AE latent density and VAE sampled latent $z$. VAE has a smooth and favorable latent space compared to the latent space in AE. Image source: kingma2013auto
  • Figure 2: Diagram of the end to end VAE model architecture with reparameterization trick
  • Figure 3: PixelCNN and conditioning on prior pixels. Image adapted from daCosta_autoencoder2021
  • Figure 4: PixelVAE utilizing PixelCNN to generate sharper images. Image adapted from gulrajani2016pixelvae
  • Figure 5: Conditional VAE: You can think of $X$ as the condition upon which the output is being generated.
  • ...and 96 more figures