Information decomposition in complex systems via machine learning

Kieran A. Murphy; Dani S. Bassett

Information decomposition in complex systems via machine learning

Kieran A. Murphy, Dani S. Bassett

TL;DR

The paper addresses identifying microscale variation most predictive of macroscale behavior in complex systems by leveraging mutual information to relate multiple observables across scales. It introduces a practical distributed information bottleneck (DIB) framework that learns per-input lossy encodings $U_i$ and optimizes $\mathcal{L}_\textnormal{DIB} = \beta \sum_{i=1}^N I(U_i; X_i) - I(\boldsymbol{U}; Y)$ using neural encoders and variational bounds to estimate mutual information. The authors demonstrate the approach on a Boolean circuit and on an amorphous material under shear, showing that the method identifies which microvariables carry macroscale relevance and yields a spectrum of compression schemes that reveal the structure of information flow; for example, in the circuit the most informative inputs emerge in a predictable order, while in glass the most informative density measurements concentrate in inner radial shells, with predictive accuracy improving from ~72% with one bit to over 90% with ~20 bits. Overall, the work provides a scalable, interpretable, information-theoretic tool for connecting microstructure to macroscopic behavior in complex systems, complementing PID with a tractable, ML-driven decomposition that scales to hundreds of inputs.

Abstract

One of the fundamental steps toward understanding a complex system is identifying variation at the scale of the system's components that is most relevant to behavior on a macroscopic scale. Mutual information provides a natural means of linking variation across scales of a system due to its independence of functional relationship between observables. However, characterizing the manner in which information is distributed across a set of observables is computationally challenging and generally infeasible beyond a handful of measurements. Here we propose a practical and general methodology that uses machine learning to decompose the information contained in a set of measurements by jointly optimizing a lossy compression of each measurement. Guided by the distributed information bottleneck as a learning objective, the information decomposition identifies the variation in the measurements of the system state most relevant to specified macroscale behavior. We focus our analysis on two paradigmatic complex systems: a Boolean circuit and an amorphous material undergoing plastic deformation. In both examples, the large amount of entropy of the system state is decomposed, bit by bit, in terms of what is most related to macroscale behavior. The identification of meaningful variation in data, with the full generality brought by information theory, is made practical for studying the connection between micro- and macroscale structure in complex systems.

Information decomposition in complex systems via machine learning

TL;DR

and optimizes

using neural encoders and variational bounds to estimate mutual information. The authors demonstrate the approach on a Boolean circuit and on an amorphous material under shear, showing that the method identifies which microvariables carry macroscale relevance and yields a spectrum of compression schemes that reveal the structure of information flow; for example, in the circuit the most informative inputs emerge in a predictable order, while in glass the most informative density measurements concentrate in inner radial shells, with predictive accuracy improving from ~72% with one bit to over 90% with ~20 bits. Overall, the work provides a scalable, interpretable, information-theoretic tool for connecting microstructure to macroscopic behavior in complex systems, complementing PID with a tractable, ML-driven decomposition that scales to hundreds of inputs.

Abstract

Paper Structure (3 sections, 8 equations, 8 figures)

This paper contains 3 sections, 8 equations, 8 figures.

Methods
Results
Discussion

Figures (8)

Figure 1: Decomposing the information contained in the inputs of a Boolean circuit.(a) Ten binary inputs $\boldsymbol{X}=(X_1, ..., X_{10})$ are connected via AND, OR, and XOR gates to a binary output $Y$. (b) A lossy compression $U_i$ is learned for each $X_i$ and then all $U_i$ are combined as input to a machine learning model trained to predict $Y$. (c) The distributed information plane displays the predictive information about the output (left vertical axis, black) as a function of the total information utilized about the input. For each value of total information into the model there is an allocation of information to the input gates indicating their relevance to the output $Y$ (right vertical axis, colors corresponding to input gates in panel (a)). The subset of inputs identified as containing the most relevant information ($I(U_i;X_i)\ge 0.1$ bits) are indicated at the top of the plot. Dashed lines are used for the information allocations when there is significant overlap. (d) The mutual information between all subsets of input channels and $Y$ are displayed on the distributed information plane as black circles. The optimization of the distributed IB (gray curve) identified subsets of inputs that contain the most predictive information (open circles).
Figure 2: Decomposing structural information about imminent rearrangement in a sheared glass.(a)Inset: Given a local neighborhood in a sheared glass, densities of radial shells for the small (type A) and large (type B) particles were used to predict whether the neighborhood is the locus of an imminent rearrangement event. Main: For a gradually quenched glass, the information that is predictive of rearrangement (black) increased as the most predictive density information was identified and incorporated into the machine learning model. The accuracy (blue) was comparable to a support vector machine (SVM) (dashed line) after around twenty bits. (b) Sharing the horizontal axis with panel (a), the amount of information extracted about each of the radial density measurements of small (top) and large (bottom) particles reveals the radii with the most predictive information at each level of approximation. The system's average density values for each particle type with type A at the center, also known as the radial distribution functions $g_\textnormal{AA}(r)$ and $g_\textnormal{AB}(r)$, are shown on the right. (c,d) The same as panels (a, b) but for glass that was prepared via a rapid quench rather than a gradual quench.
Figure 3: Selected bits of information as distinctions among raw measurement values.(a) Lossy compression is achieved by mapping the raw values of $X$ to probability distributions in latent space. The statistical similarity of the conditional distributions, visualized as a distance matrix for all pairs of feature values, determines how distinguishable the raw feature values are to the predictive model. (b) The single most predictive bit of information about rearrangement in the gradually quenched glass came predominantly from two density measurements. The distinguishability matrices indicate that the compression scheme applied a simple threshold to these measurements: density values less than a cutoff value $\rho^\prime$ were indistinguishable from each other, as were values above the cutoff. The histograms of density values conditioned on rearrangement (right) show that the learned cutoff value separates the probability masses. (c) The twenty most predictive bits of radial density information in the rapidly quenched glass were selected from many radial bands. The two that contribute more than a bit of information each correspond to the density of type A particles near the center; one compression scheme effectively counted the number of particles in the high density shell. The distinguishability matrices of the next five most informative radial bands are shown below.
Figure 4: Information decomposed in terms of per-particle measurement basis.(a) Instead of the density of radial shells, each particle's position and type in a local neighborhood were used as input measurements to relate to rearrangement. (b) The per-particle information transmitted as a function of particle position, for the small type A (left) and large type B (right) particles, for the predictive model utilizing 66 bits of information about the rapidly quenched glass. The scale bar is a distance of one in simulation units, equal to the length scale of interaction between types A and B particles. (c) Averaged radially, the information (black) resides in particles that are situated in the first troughs of the radial distribution function, $g(r)$ (colored curves). (d) For a particle at position $\vec{r}_0$, the distinguishability of particles of the same type at all other locations has a radial structure and indicates that negligible azimuthal information was transmitted.
Figure S1: Information decomposition of additional Boolean circuits.(a-f) Distributed IB analysis of randomly generated Boolean circuits, from three to six input gates. The circuit diagram (left) displays the circuit that was used to generate the joint distribution for training the distributed IB. Under each circuit we show two alternative routes to probe the importance of $X_i$ with respect to $Y$: model weights of logistic regression and Shapley values regarding information gain about the output $Y$. The distributed information plane (middle) shows the total information $I(\boldsymbol{U};Y)$ (black) and the information allocation by input gate $I(U_i;X_i)$ (color) as a function of total information transmitted to the predictive model, $\sum I(U_i;X_i)$. Information allocation curves are dashed in cases where it aids visual clarity. On the right, we reproduce the distributed IB trajectory in gray and compare to the information contained by each of the possible discrete subsets. Each discrete subset is colored according to the input gates in the subset. We have shifted points horizontally when there was overlap. In all plots, the horizontal dashed line indicates $H(Y)$, the entropy of $Y$, though it occasionally coincides with the upper horizontal axis. In f, we suppress visualization of the subsets that all have zero information, and instead indicate the number of such subsets.
...and 3 more figures

Information decomposition in complex systems via machine learning

TL;DR

Abstract

Information decomposition in complex systems via machine learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)