Table of Contents
Fetching ...

Generalized Denoising Auto-Encoders as Generative Models

Yoshua Bengio, Li Yao, Guillaume Alain, Pascal Vincent

TL;DR

The paper address the problem of linking regularized auto-encoder training to the underlying data-generating distribution beyond Gaussian, continuous settings.It introduces a generalized denoising auto-encoder framework that learns $P(X|\tilde{X})$ under arbitrary corruption $\cal C(\tilde{X}|X)$ and reconstruction loss, enabling a Markov-chain-based sampler whose stationary distribution estimates ${\cal P}(X)$.A key contribution is the consistency analysis showing convergence to the true data distribution under ergodicity and estimator consistency, along with the insight that local corruption yields simpler conditional densities amenable to energy-based interpretations.The work enhances generative modeling by introducing walkback training to mitigate spurious modes, supported by experiments on synthetic data and MNIST that demonstrate improved sampling quality and density estimates compared with prior approaches.

Abstract

Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data-generating density, in the case where the corruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying data-generating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. Another issue is the mathematical justification which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise (or non-infinitesimal contractive penalty).

Generalized Denoising Auto-Encoders as Generative Models

TL;DR

The paper address the problem of linking regularized auto-encoder training to the underlying data-generating distribution beyond Gaussian, continuous settings.It introduces a generalized denoising auto-encoder framework that learns $P(X|\tilde{X})$ under arbitrary corruption $\cal C(\tilde{X}|X)$ and reconstruction loss, enabling a Markov-chain-based sampler whose stationary distribution estimates ${\cal P}(X)$.A key contribution is the consistency analysis showing convergence to the true data distribution under ergodicity and estimator consistency, along with the insight that local corruption yields simpler conditional densities amenable to energy-based interpretations.The work enhances generative modeling by introducing walkback training to mitigate spurious modes, supported by experiments on synthetic data and MNIST that demonstrate improved sampling quality and density estimates compared with prior approaches.

Abstract

Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data-generating density, in the case where the corruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying data-generating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. Another issue is the mathematical justification which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise (or non-infinitesimal contractive penalty).

Paper Structure

This paper contains 9 sections, 3 theorems, 5 equations, 3 figures, 2 algorithms.

Key Result

Theorem 1

If $P_{\theta_n}(X | \tilde{X})$ is a consistent estimator of the true conditional distribution ${\cal P}(X | \tilde{X})$ and $T_n$ defines an ergodic Markov chain, then as the number of examples $n\rightarrow \infty$, the asymptotic distribution $\pi_n(X)$ of the generated samples converges to the

Figures (3)

  • Figure 1: Although ${\cal P}(X)$ may be complex and multi-modal, ${\cal P}(X|\tilde{X})$ is often simple and approximately unimodal (e.g., multivariate Gaussian, pink oval) for most values of $\tilde{X}$ when ${\cal C}(\tilde{X}|X)$ is a local corruption. ${\cal P}(X)$ can be seen as an infinite mixture of these local distributions (weighted by ${\cal P}(\tilde{X})$).
  • Figure 2: Top left: histogram of a data-generating distribution (true, blue), the empirical distribution (red), and the estimated distribution using a denoising maximum likelihood estimator. Other figures: pairs of variables (out of 10) showing the training samples and the model-generated samples.
  • Figure 3: Successive samples generated by Markov chain associated with the trained DAEs according to the plain sampling scheme (left) and walkback sampling scheme (right). There are less "spurious" samples with the walkback algorithm.

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Corollary 1
  • proof
  • Proposition 1
  • proof