Table of Contents
Fetching ...

A Deep and Tractable Density Estimator

Benigno Uria, Iain Murray, Hugo Larochelle

TL;DR

The paper tackles the limitation of NADE/RNADE requiring a fixed data ordering by introducing an order-agnostic training procedure that shares parameters across all possible orderings. This enables training deep autoregressive density estimators with linear-cost scaling and allows the creation of on-the-fly ensembles by averaging over orderings without extra training. Empirically, order-agnostic NADEs achieve competitive performance relative to fixed-order models on binary and real-valued datasets, with ensembles providing consistent gains and deep architectures delivering state-of-the-art results on challenging tasks like BSDS300 patches and MNIST-related data. The approach preserves tractable exact marginalization and sampling, offering flexible inference and scalable density estimation without resorting to heavy MCMC methods.

Abstract

The Neural Autoregressive Distribution Estimator (NADE) and its real-valued version RNADE are competitive density models of multidimensional data across a variety of domains. These models use a fixed, arbitrary ordering of the data dimensions. One can easily condition on variables at the beginning of the ordering, and marginalize out variables at the end of the ordering, however other inference tasks require approximate inference. In this work we introduce an efficient procedure to simultaneously train a NADE model for each possible ordering of the variables, by sharing parameters across all these models. We can thus use the most convenient model for each inference task at hand, and ensembles of such models with different orderings are immediately available. Moreover, unlike the original NADE, our training procedure scales to deep models. Empirically, ensembles of Deep NADE models obtain state of the art density estimation performance.

A Deep and Tractable Density Estimator

TL;DR

The paper tackles the limitation of NADE/RNADE requiring a fixed data ordering by introducing an order-agnostic training procedure that shares parameters across all possible orderings. This enables training deep autoregressive density estimators with linear-cost scaling and allows the creation of on-the-fly ensembles by averaging over orderings without extra training. Empirically, order-agnostic NADEs achieve competitive performance relative to fixed-order models on binary and real-valued datasets, with ensembles providing consistent gains and deep architectures delivering state-of-the-art results on challenging tasks like BSDS300 patches and MNIST-related data. The approach preserves tractable exact marginalization and sampling, offering flexible inference and scalable density estimation without resorting to heavy MCMC methods.

Abstract

The Neural Autoregressive Distribution Estimator (NADE) and its real-valued version RNADE are competitive density models of multidimensional data across a variety of domains. These models use a fixed, arbitrary ordering of the data dimensions. One can easily condition on variables at the beginning of the ordering, and marginalize out variables at the end of the ordering, however other inference tasks require approximate inference. In this work we introduce an efficient procedure to simultaneously train a NADE model for each possible ordering of the variables, by sharing parameters across all these models. We can thus use the most convenient model for each inference task at hand, and ensembles of such models with different orderings are immediately available. Moreover, unlike the original NADE, our training procedure scales to deep models. Empirically, ensembles of Deep NADE models obtain state of the art density estimation performance.

Paper Structure

This paper contains 10 sections, 9 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Test-set average log-likelihood per datapoint for RNADEs trained with our new procedure on binarized images of digits.
  • Figure 2: Top: 50 examples from binarized-MNIST ordered by decreasing likelihood under a 2-hidden-layer NADE. Bottom: 50 samples from a 2-hidden-layer NADE, also ordered by decreasing likelihood under the model.
  • Figure 3: Example of marginalization and sampling. First column shows five examples from the test set of the MNIST dataset. The second column shows the density of these examples when a random 10 by 10 pixel region is marginalized. The right-most five columns show samples for the hollowed region. Both tasks can be done easily with a NADE where the pixels to marginalize are at the end of the ordering.
  • Figure 4: Top:50 receptive fields (columns of $\boldsymbol{W}$) with the biggest L2 norm. Bottom: Associated receptive fields to the input masks.
  • Figure 5: Top: 50 examples of $8\!\times\!8$ patches in the BSDS300 dataset ordered by decreasing likelihood under a 6-hidden-layer NADE. Bottom: 50 samples from a 6-hidden-layer NADE.