Table of Contents
Fetching ...

Leveraging advances in machine learning for the robust classification and interpretation of networks

Raima Carol Appaw, Nicholas Fountain-Jones, Michael A. Charleston

TL;DR

This work tackles the problem of identifying the generative model that best explains observed networks by leveraging interpretable machine learning. It combines large-scale synthetic data from ER, SW, Spatial, SF, and SBM models with empirical networks, and uses SHAP and Friedman-Hastie statistics to uncover main effects and feature interactions among a rich set of graph metrics, including spectral properties. The study demonstrates near-perfect classification accuracy and clarifies how spectral measures and centrality-related features drive model discrimination, providing thresholds where interactions become decisive. The authors also deliver a practical toolkit, including an open-source pipeline and an interactive Shiny app, enabling researchers to classify new networks and interpret the driving feature interactions in real-world contexts.

Abstract

The ability to simulate realistic networks based on empirical data is an important task across scientific disciplines, from epidemiology to computer science. Often simulation approaches involve selecting a suitable network generative model such as Erdös-Rényi or small-world. However, few tools are available to quantify if a particular generative model is suitable for capturing a given network structure or organization. We utilize advances in interpretable machine learning to classify simulated networks by our generative models based on various network attributes, using both primary features and their interactions. Our study underscores the significance of specific network features and their interactions in distinguishing generative models, comprehending complex network structures, and the formation of real-world networks.

Leveraging advances in machine learning for the robust classification and interpretation of networks

TL;DR

This work tackles the problem of identifying the generative model that best explains observed networks by leveraging interpretable machine learning. It combines large-scale synthetic data from ER, SW, Spatial, SF, and SBM models with empirical networks, and uses SHAP and Friedman-Hastie statistics to uncover main effects and feature interactions among a rich set of graph metrics, including spectral properties. The study demonstrates near-perfect classification accuracy and clarifies how spectral measures and centrality-related features drive model discrimination, providing thresholds where interactions become decisive. The authors also deliver a practical toolkit, including an open-source pipeline and an interactive Shiny app, enabling researchers to classify new networks and interpret the driving feature interactions in real-world contexts.

Abstract

The ability to simulate realistic networks based on empirical data is an important task across scientific disciplines, from epidemiology to computer science. Often simulation approaches involve selecting a suitable network generative model such as Erdös-Rényi or small-world. However, few tools are available to quantify if a particular generative model is suitable for capturing a given network structure or organization. We utilize advances in interpretable machine learning to classify simulated networks by our generative models based on various network attributes, using both primary features and their interactions. Our study underscores the significance of specific network features and their interactions in distinguishing generative models, comprehending complex network structures, and the formation of real-world networks.
Paper Structure (42 sections, 21 equations, 20 figures, 2 tables)

This paper contains 42 sections, 21 equations, 20 figures, 2 tables.

Figures (20)

  • Figure 1: Overview of the network classification method. 1. Many examples of networks are simulated with the generative models (light grey) and their features are calculated and important features retained during data preprocessing (dark grey). 2. Data pre-processing includes feature extraction and engineering. 3. Data is split into test and training data sets (blue); 4. Minority Over-sampling Technique (SMOTE) with the tidymodel framework in R software kuhn2022tidy is applied to correct for class imbalance; 5. Models are trained, and the best classification model is selected (pink), once the models' hyperparameters have been tuned (light grey); 6. The test data is used for model prediction (yellow) and evaluation (pink); 7. The final selected model is applied to new data sets (green) to predict the generative model class (yellow).
  • Figure 2: Correlation among selected network features. Most features are correlated (using the Pearson correlation). The strength and direction of correlation are indicated by size and color. For example, Fiedler value and mean degree are highly positively correlated, and normalized Fiedler has a higher negative correlation with modularity.
  • Figure 3: The SHAP feature importance plot provides insight into the significance of various features in predicting Erdös-Rényi, stochastic-block-model, scale-free, spatial, and small-world networks. Key features are positioned at the top of the plot and are represented by longer orange bars, indicating their stronger importance in influencing the model’s outcome. Notable predictors for the network types include normalized Fiedler, modularity, degree centrality, transitivity, and spectral radius. In the feature effect plot, blue indicates lower feature values, while red denotes higher values, across the population of all network instances.
  • Figure 4: SHAP dependency plot showing the scatter plot of the relationship between each feature and its importance in predicting the final output of the Erdös-Rényi, stochastic-block-model, scale-free, spatial, and small-world generative models across different levels or values. The y-axis typically represents the SHAP value, which quantifies the impact of the feature on the model's prediction, while the x-axis shows the feature's value. This plot provides insights into how the model's prediction changes as the feature's value varies, for each class separately. The direction and magnitude of the SHAP values across the different classes aid in discerning how important the features are for each class prediction and whether its effect is consistent across all classes or varies. Overall, this plot aids in understanding the model's behavior and the relative importance of features across different classes, providing valuable insights into the model's overall decision-making process for all features across all model classes.
  • Figure 5: Shows the proportion of joint effect variability of two features explained by the pariwise interactions on Erdös-Rényi, small world, scale free, spatial and stochastic block model predictions. The x-axis and y-axis representing the feature values and pairwise feature combinations respectively. The length of the bar associated with the pairwise features typically indicates the strength of the pairwise interaction and how the proportion of the joint effect variability explained by their pairwise interaction influence the models final prediction across the different generative models.
  • ...and 15 more figures