Table of Contents
Fetching ...

MADE: Masked Autoencoder for Distribution Estimation

Mathieu Germain, Karol Gregor, Iain Murray, Hugo Larochelle

TL;DR

MADE transforms autoencoders into tractable distribution estimators by enforcing autoregressive conditioning via masks on connections, allowing each output to model p(xd | x_<d) and yielding p(x) as a product of conditionals. The approach scales to deep networks (Deep MADE) and supports order- and connectivity-agnostic training, enabling efficient, flexible probabilistic modeling with single-pass inference. Empirical results on UCI binary datasets and binarized MNIST show competitive log-likelihoods with substantial speed advantages over prior autoregressive methods. The method’s GPU-friendly design and tunable masking strategies make it a practical alternative for high-dimensional distribution estimation.

Abstract

There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.

MADE: Masked Autoencoder for Distribution Estimation

TL;DR

MADE transforms autoencoders into tractable distribution estimators by enforcing autoregressive conditioning via masks on connections, allowing each output to model p(xd | x_<d) and yielding p(x) as a product of conditionals. The approach scales to deep networks (Deep MADE) and supports order- and connectivity-agnostic training, enabling efficient, flexible probabilistic modeling with single-pass inference. Empirical results on UCI binary datasets and binarized MNIST show competitive log-likelihoods with substantial speed advantages over prior autoregressive methods. The method’s GPU-friendly design and tunable masking strategies make it a practical alternative for high-dimensional distribution estimation.

Abstract

There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.

Paper Structure

This paper contains 12 sections, 12 equations, 3 figures, 8 tables, 1 algorithm.

Figures (3)

  • Figure 1: Left: Conventional three hidden layer autoencoder. Input in the bottom is passed through fully connected layers and point-wise nonlinearities. In the final top layer, a reconstruction specified as a probability distribution over inputs is produced. As this distribution depends on the input itself, a standard autoencoder cannot predict or sample new data. Right: MADE. The network has the same structure as the autoencoder, but a set of connections is removed such that each input unit is only predicted from the previous ones, using multiplicative binary masks (${\bf M}^{{\bf W}^1}$, ${\bf M}^{{\bf W}^2}$, ${\bf M}^{{\bf V}}$). In this example, the ordering of the input is changed from 1,2,3 to 3,1,2. This change is explained in section \ref{['subsection:order_agnostic']}, but is not necessary for understanding the basic principle. The numbers in the hidden units indicate the maximum number of inputs on which the $k^{\rm th}$ unit of layer $l$ depends. The masks are constructed based on these numbers (see Equations \ref{['eqn:hid_masks_deep']} and \ref{['eqn:out_masks_deep']}). These masks ensure that MADE satisfies the autoregressive property, allowing it to form a probabilistic model, in this example $p({\bf x}) = p(x_2)\, p(x_3|x_2)\, p(x_1|x_2,x_3)$. Connections in light gray correspond to paths that depend only on 1 input, while the dark gray connections depend on 2 inputs.
  • Figure 2: Impact of the number of masks used with a single hidden layer, 500 hidden units network, on binarized MNIST.
  • Figure 3: Left: Samples from a 2 hidden layer MADE. Right: Nearest neighbour in binarized MNIST.