Table of Contents
Fetching ...

Machine Learning Informed by Micro and Mesoscopic Statistical Physics Methods for Community Detection

Yijun Ran, Junfan Yi, Wei Si, Michael Small, Ke-ke Shang

TL;DR

This work tackles the limitation of mesoscopic-only community detection by embedding micro-level node-pair similarities into mesoscopic structures using ensemble learning. The authors build a framework that samples first- and second-order node pairs, computes microscopic features (degree and clustering heterogeneity, common neighbors), and trains DT, RF, and XGBoost models to estimate pairwise similarity, which is squared and integrated into a weighted similarity network for final detection. Across artificial and real networks, the approach yields higher modularity $Q$ (and $Q^w$) and improved ground-truth alignment measured by $NMI$ and $ARI$, with the strongest gains when ground-truth labels are available; correlations between node-pair similarity and evaluation metrics reinforce the central premise. The results illustrate a productive synergy between machine learning and statistical-physics methods, offering a scalable, robust path to uncovering real-world community structures and suggesting a teacher-student dynamic where physics-guided insights guide learning and, in turn, learning informs refined physics-based detection.

Abstract

Community detection plays a crucial role in understanding the structural organization of complex networks. Previous methods, particularly those from statistical physics, primarily focus on the analysis of mesoscopic network structures and often struggle to integrate fine-grained node similarities. To address this limitation, we propose a low-complexity framework that integrates machine learning to embed micro-level node-pair similarities into mesoscopic community structures. By leveraging ensemble learning models, our approach enhances both structural coherence and detection accuracy. Experimental evaluations on artificial and real-world networks demonstrate that our framework consistently outperforms conventional methods, achieving higher modularity and improved accuracy in NMI and ARI. Notably, when ground-truth labels are available, our approach yields the most accurate detection results, effectively recovering real-world community structures while minimizing misclassifications. To further explain our framework's performance, we analyze the correlation between node-pair similarity and evaluation metrics. The results reveal a strong and statistically significant correlation, underscoring the critical role of node-pair similarity in enhancing detection accuracy. Overall, our findings highlight the synergy between machine learning and statistical physics, demonstrating how machine learning techniques can enhance network analysis and uncover complex structural patterns.

Machine Learning Informed by Micro and Mesoscopic Statistical Physics Methods for Community Detection

TL;DR

This work tackles the limitation of mesoscopic-only community detection by embedding micro-level node-pair similarities into mesoscopic structures using ensemble learning. The authors build a framework that samples first- and second-order node pairs, computes microscopic features (degree and clustering heterogeneity, common neighbors), and trains DT, RF, and XGBoost models to estimate pairwise similarity, which is squared and integrated into a weighted similarity network for final detection. Across artificial and real networks, the approach yields higher modularity (and ) and improved ground-truth alignment measured by and , with the strongest gains when ground-truth labels are available; correlations between node-pair similarity and evaluation metrics reinforce the central premise. The results illustrate a productive synergy between machine learning and statistical-physics methods, offering a scalable, robust path to uncovering real-world community structures and suggesting a teacher-student dynamic where physics-guided insights guide learning and, in turn, learning informs refined physics-based detection.

Abstract

Community detection plays a crucial role in understanding the structural organization of complex networks. Previous methods, particularly those from statistical physics, primarily focus on the analysis of mesoscopic network structures and often struggle to integrate fine-grained node similarities. To address this limitation, we propose a low-complexity framework that integrates machine learning to embed micro-level node-pair similarities into mesoscopic community structures. By leveraging ensemble learning models, our approach enhances both structural coherence and detection accuracy. Experimental evaluations on artificial and real-world networks demonstrate that our framework consistently outperforms conventional methods, achieving higher modularity and improved accuracy in NMI and ARI. Notably, when ground-truth labels are available, our approach yields the most accurate detection results, effectively recovering real-world community structures while minimizing misclassifications. To further explain our framework's performance, we analyze the correlation between node-pair similarity and evaluation metrics. The results reveal a strong and statistically significant correlation, underscoring the critical role of node-pair similarity in enhancing detection accuracy. Overall, our findings highlight the synergy between machine learning and statistical physics, demonstrating how machine learning techniques can enhance network analysis and uncover complex structural patterns.

Paper Structure

This paper contains 16 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: A real-world network illustrating the limitations of existing community detection algorithms. Nodes in different colors denote distinct communities. (a) The ground-truth community structure of the LOL network (with more details in Table \ref{['tab:table1']}). (b) The visualization of the LOL network’s communities detected via the Leiden algorithm.
  • Figure 2: The node-pair similarity related to network modularity. The network modularity $Q$ shows a strong correlation with node-pair similarity, as empirically confirmed. $S_i^{in}$ is normalized by the sum of similarities between all node pairs and the total number of internal links within the community $i$. Similarly, $S_i^{out}$ is normalized by the sum of similarities between all node pairs and the total number of external links connecting the community $i$ to others. $Q_i$ denotes the modularity of the community $i$, and $C_i$ represents the number of nodes within the community $i$. To validate this assumption, we use the DBLP network (described in Table \ref{['tab:table1']}), with communities detected using the Infomap algorithm. The r is the Pearson correlation coefficient, and the P-value is from the Student's t test.
  • Figure 3: Overview of the proposed detection framework. In the absence of ground-truth communities, we implement the community detection framework using a statistical-physics approach. (b) We apply four commonly used community detection algorithms (section \ref{['section22']}) to identify communities within the network, treating these detected communities as the ground truth. (c) We extract mesoscopic structural information from detected communities. Specifically, we sample first-order and second-order node pairs, classifying intra-community pairs as one category and inter-community pairs as another. These samples form the training and testing datasets for machine learning. (d) We introduce three microscopic structural features tailored to different real-world networks. (e) Classical machine learning methods are employed to predict node-pair similarity, while ensemble learning is used to enhance the performance of individual predictors. (f) A five-fold cross-validation strategy is applied to estimate similarity values for each node pair. (g) The predicted similarity values are integrated into the original network to facilitate community detection. (h) Finally, we evaluate the effectiveness of the proposed framework using three widely adopted performance metrics. Note that when ground-truth community information is available, step (b) can be omitted. We refer to this approach as the ground-truth method.
  • Figure 4: Performance improvement of the proposed framework relative to the original method. Here, $\Delta$ represents the percentage improvement in detection performance under the statistical-physics or ground-truth approaches. It is calculated as the difference between the highest detection performance achieved by these approaches and that of the original method, normalized by the latter.