DiffEnc: Variational Diffusion with a Learned Encoder

Beatrix M. G. Nielsen; Anders Christensen; Andrea Dittadi; Ole Winther

DiffEnc: Variational Diffusion with a Learned Encoder

Beatrix M. G. Nielsen, Anders Christensen, Andrea Dittadi, Ole Winther

TL;DR

DiffEnc extends variational diffusion models by introducing a trainable, time-dependent encoder that modifies the forward diffusion mean during training, while preserving sampling time. The authors derive a continuous-time analysis showing the ELBO is well-defined only with equal forward and generative variances, and they reinterpret the finite-depth loss as a weighted diffusion loss with potential noise-schedule optimization. They propose two encoder parameterizations (trainable and non-trainable) and a v-prediction loss formulation, providing closed-form expressions for the diffusion losses. Empirically, DiffEnc improves CIFAR-10 likelihood and demonstrates nontrivial, timestep-dependent transformations learned by the encoder, with competitive results on other datasets. The work suggests a flexible framework for incorporating learned encoders into diffusion models that can be combined with other diffusion enhancements in the future.

Abstract

Diffusion models may be viewed as hierarchical variational autoencoders (VAEs) with two improvements: parameter sharing for the conditional distributions in the generative process and efficient computation of the loss as independent terms over the hierarchy. We consider two changes to the diffusion model that retain these advantages while adding flexibility to the model. Firstly, we introduce a data- and depth-dependent mean function in the diffusion process, which leads to a modified diffusion loss. Our proposed framework, DiffEnc, achieves a statistically significant improvement in likelihood on CIFAR-10. Secondly, we let the ratio of the noise variance of the reverse encoder process and the generative process be a free weight parameter rather than being fixed to 1. This leads to theoretical insights: For a finite depth hierarchy, the evidence lower bound (ELBO) can be used as an objective for a weighted diffusion loss approach and for optimizing the noise schedule specifically for inference. For the infinite-depth hierarchy, on the other hand, the weight parameter has to be 1 to have a well-defined ELBO.

DiffEnc: Variational Diffusion with a Learned Encoder

TL;DR

Abstract

Paper Structure (47 sections, 142 equations, 6 figures, 8 tables)

This paper contains 47 sections, 142 equations, 6 figures, 8 tables.

Introduction
Preliminaries on Variational Diffusion Models
DiffEnc
Parameterization of the Encoder and Generative Model
$\mathbf{v}$-parameterization.
Experiments
Experimental Setup.
Results.
Related Work
Limitations and Future Work
Conclusion
Appendix
Overview of diffusion model with and without encoder
Proof that z_t given x has the correct form
Proof that the reverse process has the correct form
...and 32 more sections

Figures (6)

Figure 1: Overview of DiffEnc compared to standard diffusion models. The effect of the encoding has been amplified 5x for the sake of illustration.
Figure 2: Changes induced by the encoder on the encoded image at different timesteps: $(\mathbf{x}_t - \mathbf{x}_s)/(t-s)$ for $t = 0.4, 0.6, 0.8, 1.0$ and $s= t - 0.1$. Changes have been summed over the channels with red and blue denoting positive and negative changes, respectively. For $t \to 1$, global properties such as approximate position of objects are encoded, where for smaller $t$ changes are more fine-grained and tend to enhance high-contrast within objects and/or between object and background.
Figure 3: Comparison of unconditional samples of models. The small model struggles to make realistic images, while the large models are significantly better, as expected. For some images, details differ between the two large models, for others they disagree on the main element of the image. An example where the models make two different cars in column 9. An example where DiffEnc-32-4 makes a car and VDMv-32 makes a frog in column 7.
Figure 4: Encoded MNIST images from DiffEnc-8-2. Encoded images are close to the identity up to $t = 0.7$. From $t = 0.8$ to $t = 0.9$ the encoder slightly blurs the numbers, and from $t = 0.9$ it makes the background lighter, but keeps the high contrast in the middle of the image. Intuitively, the encoder improves the latent loss by bringing the average pixel value close to 0.
Figure 5: 100 unconditional samples from a DiffEnc-32-4 (above) and VDMv-32 (below) after 8 million training steps.
...and 1 more figures

Theorems & Definitions (1)

proof

DiffEnc: Variational Diffusion with a Learned Encoder

TL;DR

Abstract

DiffEnc: Variational Diffusion with a Learned Encoder

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (1)