Image Generation Models: A Technical History

Rouzbeh Shirvani

Image Generation Models: A Technical History

Rouzbeh Shirvani

TL;DR

This paper aims to offer a comprehensive survey of breakthrough image generation models, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods.

Abstract

Image generation has advanced rapidly over the past decade, yet the literature seems fragmented across different models and application domains. This paper aims to offer a comprehensive survey of breakthrough image generation models, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods. We provide a detailed technical walkthrough of each model type, including their underlying objectives, architectural building blocks, and algorithmic training steps. For each model type, we present the optimization techniques as well as common failure modes and limitations. We also go over recent developments in video generation and present the research works that made it possible to go from still frames to high quality videos. Lastly, we cover the growing importance of robustness and responsible deployment of these models, including deepfake risks, detection, artifacts, and watermarking.

Image Generation Models: A Technical History

TL;DR

Abstract

Paper Structure (52 sections, 140 equations, 101 figures, 15 tables)

This paper contains 52 sections, 140 equations, 101 figures, 15 tables.

Introduction
Variational Autoencoders
How VAEs work?
Reparameterization Trick
KL Collapse: when the model ignores z
Blurry Reconstructions in VAEs
Conditional VAEs
Other Variants of VAEs
DRAW: Deep Attention Recurrent Writer
Vector Quantized VAE
Deep Hierarchical VAEs
Conclusion
Generative Adversarial Networks
Training and Objective Function
Conditional GANs
...and 37 more sections

Figures (101)

Figure 1: Comparison of AE latent density and VAE sampled latent $z$. VAE has a smooth and favorable latent space compared to the latent space in AE. Image source: kingma2013auto
Figure 2: Diagram of the end to end VAE model architecture with reparameterization trick
Figure 3: PixelCNN and conditioning on prior pixels. Image adapted from daCosta_autoencoder2021
Figure 4: PixelVAE utilizing PixelCNN to generate sharper images. Image adapted from gulrajani2016pixelvae
Figure 5: Conditional VAE: You can think of $X$ as the condition upon which the output is being generated.
...and 96 more figures

Image Generation Models: A Technical History

TL;DR

Abstract

Image Generation Models: A Technical History

Authors

TL;DR

Abstract

Table of Contents

Figures (101)