Table of Contents
Fetching ...

Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection

Cencheng Shen, Youngser Park, Carey E. Priebe

TL;DR

This work tackles simultaneous vertex embedding, community detection, and unknown community size in graphs by introducing a graph encoder ensemble built on a normalized one-hot encoder ($GEE1$) and a rank-based cluster-size measure (MRI). The method yields a linear-time algorithm that combines $Z = \mathbf{A}\mathbf{W}$ embeddings, $L2$ normalization, MRI-guided model selection, and ensemble $k$-means clustering to determine $K$ and vertex labels. Empirical results on SBM and DC-SBM simulations show that normalization and ensemble components substantially improve clustering accuracy (ARI) and cluster-size recovery, compared to normalization-free and spectral baselines. The approach offers a scalable, unified framework for graph analytics with practical impact on large networks where the true number of communities is unknown.

Abstract

In this paper, we introduce a novel and computationally efficient method for vertex embedding, community detection, and community size determination. Our approach leverages a normalized one-hot graph encoder and a rank-based cluster size measure. Through extensive simulations, we demonstrate the excellent numerical performance of our proposed graph encoder ensemble algorithm.

Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection

TL;DR

This work tackles simultaneous vertex embedding, community detection, and unknown community size in graphs by introducing a graph encoder ensemble built on a normalized one-hot encoder () and a rank-based cluster-size measure (MRI). The method yields a linear-time algorithm that combines embeddings, normalization, MRI-guided model selection, and ensemble -means clustering to determine and vertex labels. Empirical results on SBM and DC-SBM simulations show that normalization and ensemble components substantially improve clustering accuracy (ARI) and cluster-size recovery, compared to normalization-free and spectral baselines. The approach offers a scalable, unified framework for graph analytics with practical impact on large networks where the true number of communities is unknown.

Abstract

In this paper, we introduce a novel and computationally efficient method for vertex embedding, community detection, and community size determination. Our approach leverages a normalized one-hot graph encoder and a rank-based cluster size measure. Through extensive simulations, we demonstrate the excellent numerical performance of our proposed graph encoder ensemble algorithm.
Paper Structure (14 sections, 11 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 11 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: This figure visually demonstrates the effect of normalization. The left panel displays the adjacency heatmap of a simulated sparse graph using simulation 1 in Section \ref{['sim1']}. The center panel shows the resulting embedding without the normalization step, while the right panel displays the resulting embedding with normalization. The blue and red dots represent the true community labels of each vertex.
  • Figure 2: This figure presents the results of cluster size estimation using the graph encoder ensemble. The estimation accuracy and the performance of different size measures are evaluated for various simulations and graph sizes. For each simulation and each graph size, we independently generate $100$ graphs, and run the ensemble algorithm to estimate the community size. The left panel of the figure illustrates the estimation accuracy as the graph size increases. The estimation accuracy represents the proportion of cases where the algorithm correctly chooses the community size. As the graph size increases, the estimation accuracy gradually improves, reaching a perfect estimation accuracy of $1$ for all simulations. The center panel focuses on simulation 3 at $n=5000$. The MRI calculates $\hat{K}=5$ as the estimated community size, which matches the ground-truth size. In the right panel, the average Silhouette Score is computed as an alternative size measure, which is biased towards smaller community sizes and chooses $\hat{K}_{SS}=2$, resulting in a different estimation compared to the ground-truth size.