Table of Contents
Fetching ...

Neural Autoregressive Distribution Estimation

Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, Hugo Larochelle

TL;DR

This work introduces Neural Autoregressive Distribution Estimation (NADE), a tractable neural autoregressive density model that factors p(x) into conditionals with shared parameters for efficient learning. It extends NADE to real-valued data (RNADE), develops orderless/deep variants (DeepNADE) with ensemble techniques, and adapts the approach to 2D data via ConvNADE to exploit image structure. Empirical results across binary vectors, binarized images, and real-valued domains show competitive or superior performance to traditional directed/undirected models, with ensembles and convolutional architectures providing substantial gains. The framework unifies tractable likelihoods, exact sampling, and flexible conditioning, enabling robust unsupervised density estimation across diverse data types.

Abstract

We present Neural Autoregressive Distribution Estimation (NADE) models, which are neural network architectures applied to the problem of unsupervised distribution and density estimation. They leverage the probability product rule and a weight sharing scheme inspired from restricted Boltzmann machines, to yield an estimator that is both tractable and has good generalization performance. We discuss how they achieve competitive performance in modeling both binary and real-valued observations. We also present how deep NADE models can be trained to be agnostic to the ordering of input dimensions used by the autoregressive product rule decomposition. Finally, we also show how to exploit the topological structure of pixels in images using a deep convolutional architecture for NADE.

Neural Autoregressive Distribution Estimation

TL;DR

This work introduces Neural Autoregressive Distribution Estimation (NADE), a tractable neural autoregressive density model that factors p(x) into conditionals with shared parameters for efficient learning. It extends NADE to real-valued data (RNADE), develops orderless/deep variants (DeepNADE) with ensemble techniques, and adapts the approach to 2D data via ConvNADE to exploit image structure. Empirical results across binary vectors, binarized images, and real-valued domains show competitive or superior performance to traditional directed/undirected models, with ensembles and convolutional architectures providing substantial gains. The framework unifies tractable likelihoods, exact sampling, and flexible conditioning, enabling robust unsupervised density estimation across diverse data types.

Abstract

We present Neural Autoregressive Distribution Estimation (NADE) models, which are neural network architectures applied to the problem of unsupervised distribution and density estimation. They leverage the probability product rule and a weight sharing scheme inspired from restricted Boltzmann machines, to yield an estimator that is both tractable and has good generalization performance. We discuss how they achieve competitive performance in modeling both binary and real-valued observations. We also present how deep NADE models can be trained to be agnostic to the ordering of input dimensions used by the autoregressive product rule decomposition. Finally, we also show how to exploit the topological structure of pixels in images using a deep convolutional architecture for NADE.

Paper Structure

This paper contains 17 sections, 26 equations, 10 figures, 8 tables, 2 algorithms.

Figures (10)

  • Figure 1: Illustration of a NADE model. In this example, in the input layer, units with value 0 are shown in black while units with value 1 are shown in white. The dashed border represents a layer pre-activation.The outputs $\hat{\boldsymbol{x}}_O$ give predictive probabilities for each dimension of a vector $\boldsymbol{x}_O$, given elements earlier in some ordering. There is no path of connections between an output and the value being predicted, or elements of $\boldsymbol{x}_O$ later in the ordering. Arrows connected together correspond to connections with shared (tied) parameters.
  • Figure 2: Illustration of a DeepNADE model with two hidden layers. The dashed border represents a layer pre-activation. A mask $\boldsymbol{m}_{o_{<d}}$ specifies a subset of variables to condition on. A conditional or predictive probability of the remaining variables is given in the final layer. Note that the output units with a corresponding input mask of value 1 (shown with dotted contour) are effectively not involved in DeepNADE's training loss (Equation \ref{['eq:deepnade:binary:loss']}).
  • Figure 3: Illustration of a ConvNADE that combines a convolutional neural network with three hidden layers and a fully connected feed-forward neural network with two hidden layers. The dashed border represents a layer pre-activation. Units with a dotted contour are not valid conditionals since they depend on themselves i.e. they were given in the input.
  • Figure 4: (Left): samples from NADE trained on binarized MNIST. (Right): probabilities from which each pixel was sampled. Ancestral sampling was used with the same fixed ordering used during training.
  • Figure 5: Network architectures for binarized MNIST. (a) ConvNADE with 8 convolutional layers (depicted in blue). The number of feature maps for a given layer is given by the number before the "@" symbol followed by the filter size and the type of convolution is specified in parentheses. (b) The same ConvNADE combined with a DeepNADE consisting of three fully-connected layers of respectively 500, 500 and 784 units.
  • ...and 5 more figures