Comparing the information content of probabilistic representation spaces

Kieran A. Murphy; Sam Dillavou; Dani S. Bassett

Comparing the information content of probabilistic representation spaces

Kieran A. Murphy, Sam Dillavou, Dani S. Bassett

TL;DR

This work addresses the problem of comparing probabilistic representation spaces by embedding them in an information-theoretic framework. It generalizes classic clustering-based measures (NMI and VI) to soft, distributional embeddings through a replacement of entropy terms with mutual informations between copies of the space, enabling comparisons across discrete and continuous representations. A fast Bhattacharyya fingerprint estimator provides scalable estimation of the information content, and an OPTICS-based procedure identifies consistently learned information fragments across model ensembles, with a differentiable formulation enabling model fusion. The experiments demonstrate that the proposed measures reveal stable information content across datasets and methods, uncover structured fragmentation of information in latent channels, and enable synthesis of weak learners into coherent representations, highlighting the practical impact for disentanglement evaluation and representation-alignment tasks.

Abstract

Probabilistic representation spaces convey information about a dataset and are shaped by factors such as the training data, network architecture, and loss function. Comparing the information content of such spaces is crucial for understanding the learning process, yet most existing methods assume point-based representations, neglecting the distributional nature of probabilistic spaces. To address this gap, we propose two information-theoretic measures to compare general probabilistic representation spaces by extending classic methods to compare the information content of hard clustering assignments. Additionally, we introduce a lightweight method of estimation that is based on fingerprinting a representation space with a sample of the dataset, designed for scenarios where the communicated information is limited to a few bits. We demonstrate the utility of these measures in three case studies. First, in the context of unsupervised disentanglement, we identify recurring information fragments within individual latent dimensions of VAE and InfoGAN ensembles. Second, we compare the full latent spaces of models and reveal consistent information content across datasets and methods, despite variability during training. Finally, we leverage the differentiability of our measures to perform model fusion, synthesizing the information content of weak learners into a single, coherent representation. Across these applications, the direct comparison of information content offers a natural basis for characterizing the processing of information.

Comparing the information content of probabilistic representation spaces

TL;DR

Abstract

Paper Structure (19 sections, 12 equations, 16 figures)

This paper contains 19 sections, 12 equations, 16 figures.

Introduction
Related work
Method
Comparing representation spaces as soft clusterings
Routes to estimation
Discovering consistently learned information fragments via OPTICS clustering of latent dimensions
Model fusion
Experiments
Comparison of related methods on synthetic spaces
Unsupervised detection of structure: channel similarity
Assessing the content of the full latent space
Model fusion in a toy example
Discussion
Appendix: Extended channel similarity results
Appendix: Extended results on synthesized latent space comparison (Sec. 4.1)
...and 4 more sections

Figures (16)

Figure 1: Similarity of representation spaces. In this work, we generalize measures to compare the information content of clustering assignments to apply to probabilistic representation spaces. (a) A hard clustering assignment, such as the living/non-living distinction conveyed by clustering $V$, communicates certain information about the dataset (here, CIFAR-10 images). Comparing the information content of different clustering assignments enables comparative analyses between algorithms, model fusion, and benchmarking. (b) We generalize measures for comparing hard clustering assignments to be applicable to probabilistic representation spaces, by recognizing the latter as soft clustering assignments. When cast in terms of information content, there is no requirement for the dimensionality of the spaces to match, and hard clusterings (e.g. labels or annotations) can be compared to probabilistic spaces.
Figure 2: Comparing similarity measures for synthetic embedding spaces.(a) A dataset of 64 points, $x=1, ..., 64$, is transformed into nine representation spaces marked i-ix. Each posterior distribution $p(u|x)$ is a Gaussian with diagonal covariance matrix and standard deviations indicated by the colored ellipses. (b) We trained a classification head on top of the latent spaces of a to predict the input $x$, given a sample from the posterior distribution $p(u|x)$. The predicted probability distributions $p(\hat{x}_j|x_i)$ are displayed as a matrix, with row corresponding to the input $x_i$. (c) The pairwise distinguishability of data points $x_i$ and $x_j$, as computed by the Bhattacharyya coefficient, serves as a fingerprint of the information content of the latent space. (d) The pairwise similarity of the representation spaces in panel a, found by a variety of methods. Runtimes to calculate the full matrix are shown above each method, except for Jensen-Shannon divergence (JSD) because it required training an additional classification network on top of each latent space. The stochastic shape metric requires the dimensionality of the compared spaces to match; undefined entries are grayed out. The Spearman rank correlation between similarity measures is shown in the bottom right.
Figure 3: Assessing the consistency of channel information in ensembles of models. We used NMI as a similarity measure for OPTICS to detect fragments of information that are consistently stored in individual channels in an ensemble of trained models. (a) The channel consistency of models trained on dsprites, for $\beta$-VAE and InfoGAN-CR. The information with respect to generative factors is shown on the left of each similarity matrix. The $\beta=4$$\beta$-VAE fragmented information inconsistently compared to the other two ensembles. (b) For a $\beta$-VAE ensemble trained on cars3d, the information content of channels was highly consistent, with seven distinct combinations of the three generative factors. Latent traversals for a representative channel from each grouping visualize the information content. (c) We compare the information content of the representatives from panel b to that of channels in $\beta$-TCVAE and FactorVAE ensembles. (d,e) Channel similarity and latent traversals for $\beta$-VAE ensembles trained on fashion-mnist and celebA. Additional channel similarity analyses and latent traversals can be found in Appx. \ref{['appendix:extended_channel_similarity']}.
Figure 4: Comparing full latent spaces.(a) We compare trained models across several methods and six hyperparameters each, all from locatello2019challenging. (b) We compare $\beta$-VAE models over the course of training. $I(U;X)/H(X)$ is the fraction of total information about the dataset contained in the latent space; a value of one means all data points are well-separated in the latent space. $\langle NMI \rangle$ and $\langle VI \rangle$ denote the average pairwise NMI and VI values over five models in an ensemble. All mutual information terms were estimated via Monte Carlo, and the displayed error bars are the standard error after accounting for the uncertainty on the constituent mutual information terms.
Figure 5: Fusing weak representation spaces.(a) Example of a one-dimensional latent space of a $\beta$-VAE trained on a dataset generated from a single periodic factor (color hue), which has SO(2) symmetry. The latent space exhibits flaws where similar values of the generative factor are mapped to dissimilar representations, as seen in the posterior distributions (left) and the distinguishability matrix of Bhattacharyya coefficients between posteriors, $\text{BC}_{ij}$ (right). (b) We optimized a synthesis representation space to maximize similarity with an ensemble of such one-dimensional latent spaces. The continuity of statistical distances between neighboring points, an assessment of the fidelity of the global structure of the generative factor, improved as the ensemble size grew. Error bars show the standard deviation over five experiments, and values are offset horizontally for visibility. (c) Synthesized two-dimensional representation spaces (posterior means shown as points; covariances as shaded ellipses) and their corresponding distinguishability matrices. Panels compare results when maximizing average NMI (left, middle) and mutual information (right).
...and 11 more figures

Comparing the information content of probabilistic representation spaces

TL;DR

Abstract

Comparing the information content of probabilistic representation spaces

Authors

TL;DR

Abstract

Table of Contents

Figures (16)