Table of Contents
Fetching ...

Reconstructing the evolution history of networked complex systems

Junya Wang, Yi-Jiao Zhang, Cong Xu, Jiaze Li, Jiachen Sun, Jiarong Xie, Ling Feng, Tianshou Zhou, Yanqing Hu

TL;DR

Intriguingly, it is discovered that for large networks, if the performance of the machine learning model is slightly better than a random guess on the pairwise order of links, reliable restoration of the overall network formation process can be achieved, suggesting that evolution history restoration is generally highly feasible on empirical networks.

Abstract

The evolution processes of complex systems carry key information in the systems' functional properties. Applying machine learning algorithms, we demonstrate that the historical formation process of various networked complex systems can be extracted, including protein-protein interaction, ecology, and social network systems. The recovered evolution process has demonstrations of immense scientific values, such as interpreting the evolution of protein-protein interaction network, facilitating structure prediction, and particularly revealing the key co-evolution features of network structures such as preferential attachment, community structure, local clustering, degree-degree correlation that could not be explained collectively by previous theories. Intriguingly, we discover that for large networks, if the performance of the machine learning model is slightly better than a random guess on the pairwise order of links, reliable restoration of the overall network formation process can be achieved. This suggests that evolution history restoration is generally highly feasible on empirical networks.

Reconstructing the evolution history of networked complex systems

TL;DR

Intriguingly, it is discovered that for large networks, if the performance of the machine learning model is slightly better than a random guess on the pairwise order of links, reliable restoration of the overall network formation process can be achieved, suggesting that evolution history restoration is generally highly feasible on empirical networks.

Abstract

The evolution processes of complex systems carry key information in the systems' functional properties. Applying machine learning algorithms, we demonstrate that the historical formation process of various networked complex systems can be extracted, including protein-protein interaction, ecology, and social network systems. The recovered evolution process has demonstrations of immense scientific values, such as interpreting the evolution of protein-protein interaction network, facilitating structure prediction, and particularly revealing the key co-evolution features of network structures such as preferential attachment, community structure, local clustering, degree-degree correlation that could not be explained collectively by previous theories. Intriguingly, we discover that for large networks, if the performance of the machine learning model is slightly better than a random guess on the pairwise order of links, reliable restoration of the overall network formation process can be achieved. This suggests that evolution history restoration is generally highly feasible on empirical networks.
Paper Structure (16 sections, 9 equations, 6 figures, 1 table, 3 algorithms)

This paper contains 16 sections, 9 equations, 6 figures, 1 table, 3 algorithms.

Figures (6)

  • Figure 1: The network formation process and its restoration.a Illustration of a network formation process. At each snapshot $T_0$, $T_1$, $T_2$, ..., $T_n$, some new edges are added (darker edges appeared earlier). The goal of this study is to restore the generation order of the edges based on the final network structure at $T_n$. b-c Diagram of the proposed approach to restoring the temporal sequence of edges for a network with partial evolution history or without any historical information.
  • Figure 2: Performance of the ensemble model and the restored edge sequence.a Test accuracy of the ensemble model as a function of the percentage of edge pairs used for training. Each data point with error bars marks the corresponding simulation results (average $\pm$ standard deviation of 100 simulations), the same for b. b Overall error $\mathcal{E}$ as a function of the accuracy $x$ of the ensemble model for different numbers of edges $E$. The solid curves represent the theoretical results from Eq. (\ref{['eq:theoretical_relation_error_x']}) and the colored crosses stand for the simulation results using the $E$ and $x$ of five real-world networks. c Simulated distributions of $D_i/E$ using the $E$ and $x$ of five real-world networks. Specifically, assuming the ground-truth sequence $\boldsymbol\alpha=(1, 2, \ldots, E)$, $100(1-x)\%$ of all edge pairs are randomly selected and artificially assigned the wrong generation order while the remaining edge pairs are assigned the correct one. Then, the restored edge sequence $\widehat{\boldsymbol\alpha}$ is obtained by applying the ranking algorithm on the artificially predicted order of all edge pairs and $D_i$'s are calculated accordingly. d-e Comparisons between the real and simulated distributions of $D_i/E$ based on the collaboration network (CN) and the PPI network (Fungi). f Diagram illustrating how the distributions in c-e are obtained. The left and right panels show the calculation of $D_i$ under a real case when we only know the coarse-grained ground-truth sequence and a simulation when we know the fine-grained ground-truth sequence, respectively. For the real case, $D_i$ cannot be calculated directly as $\alpha_i-\widehat{\alpha}_i$ so the idea is to consider an intermediate sequence $\boldsymbol\alpha^*$ by randomly assigning fine-grained order to edges added within the same snapshot and $D_i$ is calculated as $\alpha^*_i-\widehat{\alpha}_i$ instead. Then the distribution of $D_i/E$ is obtained by averaging over 5000 $\boldsymbol\alpha^*$'s to take the randomness into account. For the simulation, the calculation of $D_i$ follows a similar procedure to match with the real case. The results under the real case and simulation are labeled as "Real Data" and "Simulation" in d and e. See Algorithms 2-3 in the Methods section for more details.
  • Figure 3: Application on the PPI network for fungi.a The restored network structure and protein functional clusters at the time that the first 1000 edges were added. The size of the nodes indicates the order in which the nodes appear, the nodes added first (i.e., the nodes corresponding to the edges that are added first) are larger. The colors of the nodes represent different functions of the proteins (full protein functions are listed in SI Table S8). It can be seen that in the evolution process of the PPI network, interactions between proteins form protein clusters with specific functions. b The number of proteins by function over time counted according to the order of the edges. Proteins at both ends of each edge are considered. c The number of proteins with different functions added in each interval of 300 edges. The functions represented by each capital letter can be found in a.
  • Figure 4: Underlying growth mechanism in the restored network evolution processes. Cumulative PA function $\kappa(k)$ for a PPI network (Fungi), b World Trade Web, and c Collaboration network (Interfaces). In each figure, the yellow circles and blue triangles are the results of the ground-truth and restored evolution processes, respectively. If the growth of a network follows the PA rule, the rate at which a node with degree $k$ acquires new edges should be positively correlated with $k$ and the cumulative PA function $\kappa(k)$ is expected to grow superlinearly (see SI Sec. 9 for details). So, the solid gray line with slope $=1$ represents the case in which PA is absent. d Adjacency matrices of the evolution process for the protein function network generated by the PPI network (Fungi). Proteins with the same function in the network are treated as a single node to form a simplified protein function network where the edges represent the interactions between proteins with weights being the number of protein interactions. The upper row shows the results based on our restored temporal edge sequence while the lower row shows those based on a simulation study assuming the pure PA rule. The simulation is performed by adding edges according to the PA rule and keeping the average node degree consistent with the real network (details are provided in SI Sec. 9). e-f Visualizations of the protein function network in d when the number of edges are $E=2000$ and $E=5425$. Letters marking the nodes denote the protein functions (with specific meanings listed in SI Table S8), and the self-connected and non-self-connected edges are respectively displayed in blue and red. g The modularity newman2006modularity of the PPI network (Fungi). The yellow triangles represent results computed at the real snapshots of the networks. The blue solid lines and pink dashed lines are results based on edge generation order by our reconstruction method and the pure PA rule, respectively. h-i Adjacency matrix and protein function network of the PPI network (Fungi) obtained at the first real snapshot (i.e., $E=5425$).
  • Figure 5: Assortativity coefficient, local clustering coefficient, and shortest path length for the restored evolution processes. The assortativity coefficient for a PPI network (Bacteria), b World Trade Web (WTW), and c Animal network (Weaver). The average local clustering coefficient for d PPI network (Bacteria), e WTW, and f Animal network (Weaver). The average shortest path length for g PPI network (Bacteria), h World Trade Web (WTW), and i Animal network (Weaver). The yellow triangles represent results computed at the real snapshots of the networks. The blue solid lines and red dashed lines are results based on edge generation order by our restoring method and by random assignment, respectively. The pink dashed lines are results for networks generated assuming the pure PA rule. Note that due to the presence of disconnected components during the evolution process of a network, the computation of the average shortest path length only involves pairs of nodes that can be connected.
  • ...and 1 more figures