Determining the Number of Communities in Sparse and Imbalanced Settings
Zhixuan Shao, Can M. Le
TL;DR
This work tackles the challenge of determining the number of communities $K$ in sparse and imbalanced networks. It introduces a centered non-backtracking–type operator, $ ext{hat}\underline{H}$, whose leading eigenvalue serves as a robust goodness-of-fit statistic for testing $K$, with strong theoretical results on null behavior and asymptotic power in both dense and ultra-sparse regimes. Empirically, centering improves signal under imbalance, enabling more reliable detection of multiple blocks and more accurate label estimation, as demonstrated through extensive simulations and a political blog data example. The approach offers a tuning-free alternative to Bethe-Hessian methods, connects to non-backtracking walk theory, and provides practical recursive testing strategies for determining $K$ in real networks.
Abstract
Community structures represent a crucial aspect of network analysis, and various methods have been developed to identify these communities. However, a common hurdle lies in determining the number of communities K, a parameter that often requires estimation in practice. Existing approaches for estimating K face two notable challenges: the weak community signal present in sparse networks and the imbalance in community sizes or edge densities that result in unequal per-community expected degree. We propose a spectral method based on a novel network operator whose spectral properties effectively overcome both challenges. This operator is a refined version of the non-backtracking operator, adapted from a "centered" adjacency matrix. Its leading eigenvalues are more concentrated than those of the adjacency matrix for sparse networks, while they also demonstrate enhanced signal under imbalance scenarios, a benefit attributed to the centering step. This is justified, either theoretically or numerically, under the null model K = 1, in both dense and ultra-sparse settings. A goodness-of-fit test based on the leading eigenvalue can be applied to determine the number of communities K.
