Finding emergence in data by maximizing effective information

Mingzhe Yang; Zhipeng Wang; Kaiwei Liu; Yingqi Rong; Bing Yuan; Jiang Zhang

Finding emergence in data by maximizing effective information

Mingzhe Yang, Zhipeng Wang, Kaiwei Liu, Yingqi Rong, Bing Yuan, Jiang Zhang

TL;DR

A machine learning framework to learn macro-dynamics in an emergent latent space and quantify the degree of CE is introduced, resulting in a macro-dynamics model with enhanced causal effects.

Abstract

Quantifying emergence and modeling emergent dynamics in a data-driven manner for complex dynamical systems is challenging due to the lack of direct observations at the micro-level. Thus, it's crucial to develop a framework to identify emergent phenomena and capture emergent dynamics at the macro-level using available data. Inspired by the theory of causal emergence (CE), this paper introduces a machine learning framework to learn macro-dynamics in an emergent latent space and quantify the degree of CE. The framework maximizes effective information, resulting in a macro-dynamics model with enhanced causal effects. Experimental results on simulated and real data demonstrate the effectiveness of the proposed framework. It quantifies degrees of CE effectively under various conditions and reveals distinct influences of different noise types. It can learn a one-dimensional coarse-grained macro-state from fMRI data, to represent complex neural activities during movie clip viewing. Furthermore, improved generalization to different test environments is observed across all simulation data.

Finding emergence in data by maximizing effective information

TL;DR

A machine learning framework to learn macro-dynamics in an emergent latent space and quantify the degree of CE is introduced, resulting in a macro-dynamics model with enhanced causal effects.

Abstract

Paper Structure (43 sections, 9 theorems, 74 equations, 16 figures, 2 tables)

This paper contains 43 sections, 9 theorems, 74 equations, 16 figures, 2 tables.

Introduction
Finding causal emergence in data
Problem definition
Solution
Results
SIR
Boids
Real fMRI time series data for brains
Concluding Remarks
Methods and Data
Machine Learning Frameworks
Neural Information Squeezer (NIS)
Neural Information Squeezer Plus (NIS+)
Extensions for Practical Computations
Training NIS+
...and 28 more sections

Key Result

Theorem 5.1

For a given value of $q$, assuming that $\omega^\ast, \theta^\ast$, and $\theta'^\ast$ are the optimal solutions to the unconstrained objective optimization problem defined by equation (new optimization). Then $\phi^\ast\equiv Proj_q(\psi_{\omega^\ast}),f^\ast_q\equiv f_{\theta^\ast}$, and $\phi^{\d

Figures (16)

Figure 1: (a) An illustration of the fundamental concept of Erik Hoel's theory of causal emergence(CE). The effective information (EI) is denoted as $\mathcal{J}$ in this paper. (b) A case demonstrating CE in a discrete Markov chain. The micro-dynamics consist of eight micro-states. During the coarse-graining process, the first seven states are grouped together as one macro-state, while the eighth micro-state corresponds to the second macro-state. As a result, a transition probability matrix is formed at the macro-scale, where the effective information $\mathcal{J}(f_M)=1$ (calculated using Equation \ref{['eq:definition_EI']}), which is greater than $\mathcal{J}(f_m)=0.55$. This difference, $\Delta\mathcal{J}=0.45$, indicates the occurrence of CE, as $\Delta\mathcal{J}>0$.
Figure 1: Structure diagram of RealNVP.
Figure 2: The workflow and architecture of our proposed framework, Neural Information Squeezer Plus (NIS+), for discovering causal emergence within data. (a) Various forms of input data from our studied simulation systems such as the Boid flocking model (multi-agent system), Conway's Game of Life (two-dimensional cellular automata), and real brain fMRI time series data. (b) The framework of our proposed model, NIS+, which incorporates our previous model, NIS. The boxes represent functions or neural networks, and the arrow pointing to a cross represents the operation of information discarding. $\boldsymbol{x}_t$ and $\boldsymbol{x}_{t+1}$ represent the observational data of micro-states, while $\hat{\boldsymbol{x}}_{t+1}$ represents the predicted micro-state. $\boldsymbol{y}_t=\phi(\boldsymbol{x}_{t})$ and $\boldsymbol{y}_{t+1}=\phi(\boldsymbol{x}_{t+1})$ represent the macro-states obtained by encoding the micro-states using the encoder. $\hat{\boldsymbol{y}}_t=\phi(\hat{\boldsymbol{x}}_{t})$ and $\hat{\boldsymbol{y}}_{t+1}=\phi(\hat{\boldsymbol{x}}_{t+1})$ represent the predicted macro-states obtained by encoding the predictions of micro-states. The mathematical problems that each framework aims to solve are also illustrated in the figure. (c) The various output forms of NIS+, which include the degree of CE, the learned macro-dynamics, captured emergent patterns, and the strategy of coarse-graining.
Figure 2: The causal graph among random variables after intervention on $Y_t$ according to the framework of NIS or NIS+. Because the intervention on $Y_t$ can only affect the variables in the upper part of NIS+ framework, we ignore the variables in the lower part. In the diagram, $X_t'$ represents the random variable obtained after reversible transformation of $X_t$, $X"_t$ represents the variable directly discarded during the projection process, $\Tilde{Y}_{t+1}'$ represents a new variable composed of $\Tilde{Y}_{t+1}$ concatenated with a standard normal distribution, and $\xi_{p-q}$ represents a $p-q$ dimensional standard normal distribution. The dashed circular shape in the diagram represents the variable that is directly intervened to a uniform distribution, and the dashed arrow represents the causal relationship that is severed due to the intervention.
Figure 3: The experimental results of NIS+ and compared models on the SIR model with observational noise. (a) The phase space of the SIR model, along with four example trajectories with the same infection and recovery or death rates. The full dataset (blue area) and the partial dataset (dotted area) used for training are also displayed, consisting of 63,000 and 42,000 uniformly distributed data points, respectively. (b) The curves depict the change in dimension-averaged effective information ($\mathcal{J}$) with training epochs for different models. The lines represent the means, while the band widths represent the standard deviations of five repeated experiments. (c) A comparison is made among the vector fields of the SIR dynamics, the learned macro-dynamics of NIS+, and the macro-dynamics transformed by the Jacobian of the learned encoder. Each arrow represents a direction, and the magnitude of the derivative of the dynamics at that coordinate point. For detailed procedures, please refer to the support information section \ref{['sec:sir vector field']}. (d) A comparison is conducted to evaluate the errors in multi-step predictions for different models trained on either partial datasets (with 42,000 missing data points) or complete datasets(see details in the support information section \ref{['sec:sir data detail']}). These models include NIS+, NIS, a feed-forward neural network (NN), a feed-forward neural network with inverse probability weighting and inverse dynamics learning techniques (NN+), a Variational Autoencoder(VAE), and its reweighted and inverse dynamics version (VAE+). Please refer to the details of the parameters in method section \ref{['sec:nn vae']}. (e). The variations in the measure of CE ($\Delta\mathcal{J}$) and EIs for micro-dynamics ($\mathcal{J}(f_m)$) and macro-dynamics ($\mathcal{J}(f_M)$) are plotted as the standard deviation $\sigma$ of observation noise changes. All these indicators are averaged across dimensions. Following Rosas' definition and calculation method for CE(see method section \ref{['sec:Rosas psi']}), the yellow line demonstrates the changes in Rosas' $\Psi$Rosas_Mediano_Jensen_Seth_Barrett_Carhart-Harris_Bor_2020. The vertical line represents the threshold for the normalized MAE equaling 0.3. When $\sigma$ is larger than the threshold, the constraint of error in Equation \ref{['old optimization']} is violated, and the results are not reliable. (f) A comparison is made among the vector fields of the SIR dynamics, the learned macro-dynamics of NIS, and the macro-dynamics transformed by the encoder Jacobian matrix of NIS, in comparison with (c).
...and 11 more figures

Theorems & Definitions (16)

Theorem 5.1: Problem Transformation Theorem
Theorem 5.2: Problem Transformation in Extensions of NIS+ Theorem
Theorem 5.3: Universal Approximating Theorem of Stacked Encoder
Lemma 10.1: Bijection mapping does not affect mutual information
Lemma 10.2: Mutual information will not be affected by concatenating independent variables
Lemma 10.3: variational upper bound of a conditional entropy
proof
proof
Lemma 10.4: Mutual information will not be affected by stacked encoder
proof
...and 6 more

Finding emergence in data by maximizing effective information

TL;DR

Abstract

Finding emergence in data by maximizing effective information

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (16)