Table of Contents
Fetching ...

Contrastive Entropy Bounds for Density and Conditional Density Decomposition

Bo Hu, Jose C. Principe

TL;DR

The paper develops contrastive entropy bounds for density and conditional-density decomposition under a Bayesian Gaussian framework, introducing two main approaches: (i) a nuclear-norm objective for MDNs and (ii) a Hilbert-space inner-product bound for both MDNs and an encoder-mixture-decoder. It analyzes these bounds through Gaussian Gram matrices and Nyström-style decompositions, showing that the nuclear norm and normalized inner-product bounds yield tighter, more diverse samples than traditional KL-based training. It further introduces an encoder-mixture-decoder architecture to enable one-to-many mappings, with theoretical bounds on conditional densities and practical algorithms to train and evaluate them. Experiments on toy datasets and image datasets (MNIST/CelebA) demonstrate improved generation diversity and better alignment with data distributions, while providing a framework to quantify bound tightness and the relation to ELBO-like objectives.

Abstract

This paper studies the interpretability of neural network features from a Bayesian Gaussian view, where optimizing a cost is reaching a probabilistic bound; learning a model approximates a density that makes the bound tight and the cost optimal, often with a Gaussian mixture density. The two examples are Mixture Density Networks (MDNs) using the bound for the marginal and autoencoders using the conditional bound. It is a known result, not only for autoencoders, that minimizing the error between inputs and outputs maximizes the dependence between inputs and the middle. We use Hilbert space and decomposition to address cases where a multiple-output network produces multiple centers defining a Gaussian mixture. Our first finding is that an autoencoder's objective is equivalent to maximizing the trace of a Gaussian operator, the sum of eigenvalues under bases orthonormal w.r.t. the data and model distributions. This suggests that, when a one-to-one correspondence as needed in autoencoders is unnecessary, we can instead maximize the nuclear norm of this operator, the sum of singular values, to maximize overall rank rather than trace. Thus the trace of a Gaussian operator can be used to train autoencoders, and its nuclear norm can be used as divergence to train MDNs. Our second test uses inner products and norms in a Hilbert space to define bounds and costs. Such bounds often have an extra norm compared to KL-based bounds, which increases sample diversity and prevents the trivial solution where a multiple-output network produces the same constant, at the cost of requiring a sample batch to estimate and optimize. We propose an encoder-mixture-decoder architecture whose decoder is multiple-output, producing multiple centers per sample, potentially tightening the bound. Assuming the data are small-variance Gaussian mixtures, this upper bound can be tracked and analyzed quantitatively.

Contrastive Entropy Bounds for Density and Conditional Density Decomposition

TL;DR

The paper develops contrastive entropy bounds for density and conditional-density decomposition under a Bayesian Gaussian framework, introducing two main approaches: (i) a nuclear-norm objective for MDNs and (ii) a Hilbert-space inner-product bound for both MDNs and an encoder-mixture-decoder. It analyzes these bounds through Gaussian Gram matrices and Nyström-style decompositions, showing that the nuclear norm and normalized inner-product bounds yield tighter, more diverse samples than traditional KL-based training. It further introduces an encoder-mixture-decoder architecture to enable one-to-many mappings, with theoretical bounds on conditional densities and practical algorithms to train and evaluate them. Experiments on toy datasets and image datasets (MNIST/CelebA) demonstrate improved generation diversity and better alignment with data distributions, while providing a framework to quantify bound tightness and the relation to ELBO-like objectives.

Abstract

This paper studies the interpretability of neural network features from a Bayesian Gaussian view, where optimizing a cost is reaching a probabilistic bound; learning a model approximates a density that makes the bound tight and the cost optimal, often with a Gaussian mixture density. The two examples are Mixture Density Networks (MDNs) using the bound for the marginal and autoencoders using the conditional bound. It is a known result, not only for autoencoders, that minimizing the error between inputs and outputs maximizes the dependence between inputs and the middle. We use Hilbert space and decomposition to address cases where a multiple-output network produces multiple centers defining a Gaussian mixture. Our first finding is that an autoencoder's objective is equivalent to maximizing the trace of a Gaussian operator, the sum of eigenvalues under bases orthonormal w.r.t. the data and model distributions. This suggests that, when a one-to-one correspondence as needed in autoencoders is unnecessary, we can instead maximize the nuclear norm of this operator, the sum of singular values, to maximize overall rank rather than trace. Thus the trace of a Gaussian operator can be used to train autoencoders, and its nuclear norm can be used as divergence to train MDNs. Our second test uses inner products and norms in a Hilbert space to define bounds and costs. Such bounds often have an extra norm compared to KL-based bounds, which increases sample diversity and prevents the trivial solution where a multiple-output network produces the same constant, at the cost of requiring a sample batch to estimate and optimize. We propose an encoder-mixture-decoder architecture whose decoder is multiple-output, producing multiple centers per sample, potentially tightening the bound. Assuming the data are small-variance Gaussian mixtures, this upper bound can be tracked and analyzed quantitatively.

Paper Structure

This paper contains 32 sections, 3 theorems, 66 equations, 22 figures, 2 tables, 4 algorithms.

Key Result

Corollary 3.1

The norm defined with the polynomial of any order of a Gaussian mixture, regardless of discrete or continuous prior, has a closed form.

Figures (22)

  • Figure 2: A discrete equivalent of an autoencoder is to feed a diagonalized density vector of $p(X)$ through an encoder and a decoder matrix, both Markovian. The mean-squared error is the average of the errors between a sample's reconstruction and the sample itself, so it is the trace of the final matrix after the matrix product. As neural networks are deterministic, the Markov matrices are very sparse, for example with just one positive entry in each row.
  • Figure 3: We are only allowed to independently sample $X$ from the data distribution $p(X)$ and $Y$ from a chosen prior distribution $q(Y)$. The goal is to find deterministic mappings between them. The standard procedure is creating an encoder for $p(X,Y) = p(X)p(Y|X)$ and a decoder for $q(X,Y) = q(Y)q(X|Y)$, then minimize their divergence.
  • Figure 4: Side-by-side comparisons of generation qualities between KL-based cost, the normalized inner product between pdfs in the Hilbert space, and the nuclear norm (the sum of eigenvalues) of the Gaussian cross Gram matrix $\boldsymbol{\mathit{K}}_{XX'}$, for the datasets MNIST and CIFAR10. The nuclear norm works the best, followed by the normalized inner product, then the KL-based cost. We definitely observed that the normalization improves the sample diversity.
  • Figure 5: Learning curves of KL-based costs (KL), normalized inner products (NIP), and the nuclear norm (NucNorm) on datasets MNIST and CelebA.
  • Figure 6: ELBO variation of the nuclear norm cost. We project MNIST into a $2D$ feature space while also minimizing the divergence between the encoder joint $p(X,Y) = p(X)p(Y|X)$ and the decoder joint $q(X,Y) = q(Y)q(X|Y)$. We tried four different priors and found that the features have good separations and semantically meaningful.
  • ...and 17 more figures

Theorems & Definitions (3)

  • Corollary 3.1
  • Lemma 4
  • Proposition 6