DiffEnc: Variational Diffusion with a Learned Encoder
Beatrix M. G. Nielsen, Anders Christensen, Andrea Dittadi, Ole Winther
TL;DR
DiffEnc extends variational diffusion models by introducing a trainable, time-dependent encoder that modifies the forward diffusion mean during training, while preserving sampling time. The authors derive a continuous-time analysis showing the ELBO is well-defined only with equal forward and generative variances, and they reinterpret the finite-depth loss as a weighted diffusion loss with potential noise-schedule optimization. They propose two encoder parameterizations (trainable and non-trainable) and a v-prediction loss formulation, providing closed-form expressions for the diffusion losses. Empirically, DiffEnc improves CIFAR-10 likelihood and demonstrates nontrivial, timestep-dependent transformations learned by the encoder, with competitive results on other datasets. The work suggests a flexible framework for incorporating learned encoders into diffusion models that can be combined with other diffusion enhancements in the future.
Abstract
Diffusion models may be viewed as hierarchical variational autoencoders (VAEs) with two improvements: parameter sharing for the conditional distributions in the generative process and efficient computation of the loss as independent terms over the hierarchy. We consider two changes to the diffusion model that retain these advantages while adding flexibility to the model. Firstly, we introduce a data- and depth-dependent mean function in the diffusion process, which leads to a modified diffusion loss. Our proposed framework, DiffEnc, achieves a statistically significant improvement in likelihood on CIFAR-10. Secondly, we let the ratio of the noise variance of the reverse encoder process and the generative process be a free weight parameter rather than being fixed to 1. This leads to theoretical insights: For a finite depth hierarchy, the evidence lower bound (ELBO) can be used as an objective for a weighted diffusion loss approach and for optimizing the noise schedule specifically for inference. For the infinite-depth hierarchy, on the other hand, the weight parameter has to be 1 to have a well-defined ELBO.
