Contrastive Entropy Bounds for Density and Conditional Density Decomposition
Bo Hu, Jose C. Principe
TL;DR
The paper develops contrastive entropy bounds for density and conditional-density decomposition under a Bayesian Gaussian framework, introducing two main approaches: (i) a nuclear-norm objective for MDNs and (ii) a Hilbert-space inner-product bound for both MDNs and an encoder-mixture-decoder. It analyzes these bounds through Gaussian Gram matrices and Nyström-style decompositions, showing that the nuclear norm and normalized inner-product bounds yield tighter, more diverse samples than traditional KL-based training. It further introduces an encoder-mixture-decoder architecture to enable one-to-many mappings, with theoretical bounds on conditional densities and practical algorithms to train and evaluate them. Experiments on toy datasets and image datasets (MNIST/CelebA) demonstrate improved generation diversity and better alignment with data distributions, while providing a framework to quantify bound tightness and the relation to ELBO-like objectives.
Abstract
This paper studies the interpretability of neural network features from a Bayesian Gaussian view, where optimizing a cost is reaching a probabilistic bound; learning a model approximates a density that makes the bound tight and the cost optimal, often with a Gaussian mixture density. The two examples are Mixture Density Networks (MDNs) using the bound for the marginal and autoencoders using the conditional bound. It is a known result, not only for autoencoders, that minimizing the error between inputs and outputs maximizes the dependence between inputs and the middle. We use Hilbert space and decomposition to address cases where a multiple-output network produces multiple centers defining a Gaussian mixture. Our first finding is that an autoencoder's objective is equivalent to maximizing the trace of a Gaussian operator, the sum of eigenvalues under bases orthonormal w.r.t. the data and model distributions. This suggests that, when a one-to-one correspondence as needed in autoencoders is unnecessary, we can instead maximize the nuclear norm of this operator, the sum of singular values, to maximize overall rank rather than trace. Thus the trace of a Gaussian operator can be used to train autoencoders, and its nuclear norm can be used as divergence to train MDNs. Our second test uses inner products and norms in a Hilbert space to define bounds and costs. Such bounds often have an extra norm compared to KL-based bounds, which increases sample diversity and prevents the trivial solution where a multiple-output network produces the same constant, at the cost of requiring a sample batch to estimate and optimize. We propose an encoder-mixture-decoder architecture whose decoder is multiple-output, producing multiple centers per sample, potentially tightening the bound. Assuming the data are small-variance Gaussian mixtures, this upper bound can be tracked and analyzed quantitatively.
