Table of Contents
Fetching ...

Reweighted Wake-Sleep

Jörg Bornschein, Yoshua Bengio

TL;DR

The paper tackles the difficulty of training deep directed graphical models with many hidden layers by reframing the wake-sleep algorithm as an importance-sampling based procedure. It introduces Reweighted Wake-Sleep (RWS), which uses $K$ samples from an approximate inference network $q({\boldsymbol h}|{\boldsymbol x})$ to form an importance-weighted estimator for the gradient of the log-likelihood, reducing bias and variance as $K$ grows. It demonstrates that more powerful layer models for the inference network, such as AR-SBN and NADE, yield substantially better generative performance than traditional SBNs, with autoregressive structures enabling better posterior estimation. Experiments on MNIST and CalTech 101 Silhouettes show that RWS with a small number of samples (around 5) achieves near state-of-the-art log-likelihoods and benefits from i.i.d. sampling of latent variables, avoiding MCMC mixing issues commonly faced by alternative training methods.

Abstract

Training deep directed graphical models with many hidden variables and performing inference remains a major challenge. Helmholtz machines and deep belief networks are such models, and the wake-sleep algorithm has been proposed to train them. The wake-sleep algorithm relies on training not just the directed generative model but also a conditional generative model (the inference network) that runs backward from visible to latent, estimating the posterior distribution of latent given visible. We propose a novel interpretation of the wake-sleep algorithm which suggests that better estimators of the gradient can be obtained by sampling latent variables multiple times from the inference network. This view is based on importance sampling as an estimator of the likelihood, with the approximate inference network as a proposal distribution. This interpretation is confirmed experimentally, showing that better likelihood can be achieved with this reweighted wake-sleep procedure. Based on this interpretation, we propose that a sigmoidal belief network is not sufficiently powerful for the layers of the inference network in order to recover a good estimator of the posterior distribution of latent variables. Our experiments show that using a more powerful layer model, such as NADE, yields substantially better generative models.

Reweighted Wake-Sleep

TL;DR

The paper tackles the difficulty of training deep directed graphical models with many hidden layers by reframing the wake-sleep algorithm as an importance-sampling based procedure. It introduces Reweighted Wake-Sleep (RWS), which uses samples from an approximate inference network to form an importance-weighted estimator for the gradient of the log-likelihood, reducing bias and variance as grows. It demonstrates that more powerful layer models for the inference network, such as AR-SBN and NADE, yield substantially better generative performance than traditional SBNs, with autoregressive structures enabling better posterior estimation. Experiments on MNIST and CalTech 101 Silhouettes show that RWS with a small number of samples (around 5) achieves near state-of-the-art log-likelihoods and benefits from i.i.d. sampling of latent variables, avoiding MCMC mixing issues commonly faced by alternative training methods.

Abstract

Training deep directed graphical models with many hidden variables and performing inference remains a major challenge. Helmholtz machines and deep belief networks are such models, and the wake-sleep algorithm has been proposed to train them. The wake-sleep algorithm relies on training not just the directed generative model but also a conditional generative model (the inference network) that runs backward from visible to latent, estimating the posterior distribution of latent given visible. We propose a novel interpretation of the wake-sleep algorithm which suggests that better estimators of the gradient can be obtained by sampling latent variables multiple times from the inference network. This view is based on importance sampling as an estimator of the likelihood, with the approximate inference network as a proposal distribution. This interpretation is confirmed experimentally, showing that better likelihood can be achieved with this reweighted wake-sleep procedure. Based on this interpretation, we propose that a sigmoidal belief network is not sufficiently powerful for the layers of the inference network in order to recover a good estimator of the posterior distribution of latent variables. Our experiments show that using a more powerful layer model, such as NADE, yields substantially better generative models.

Paper Structure

This paper contains 18 sections, 13 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: A Final log-likelihood estimate w.r.t. number of samples used during training. B$L_2$-norm of the bias and standard deviation of the low-sample estimated $p_\theta$ gradient relative to a high-sample (K=5,000) based estimate.
  • Figure 2: A Final log-likelihood estimate w.r.t. number of test samples used. B Samples from the SBN/SBN 10-200-200 generative model. C Samples from the NADE/NADE 250 generative model. (We show the probabilities from which each pixel is sampled)
  • Figure 3: CalTech 101 Silhouettes: A Random selection of training data points. B Random samples from the SBN/SBN 10-50-100-300 generative network. C Random Samples from the NADE-150 generative network. (We show the probabilities from which each pixel is sampled)
  • Figure 4: Learning curves for various MNIST experiments.
  • Figure 5: Bias and standard deviation of the low-sample estimated $\log(p(x))$ (bootstrapping with K=5,000 primary samples from a SBN/SBN 10-200-200 network trained on MNIST).