Table of Contents
Fetching ...

Determining the Number of Communities in Sparse and Imbalanced Settings

Zhixuan Shao, Can M. Le

TL;DR

This work tackles the challenge of determining the number of communities $K$ in sparse and imbalanced networks. It introduces a centered non-backtracking–type operator, $ ext{hat}\underline{H}$, whose leading eigenvalue serves as a robust goodness-of-fit statistic for testing $K$, with strong theoretical results on null behavior and asymptotic power in both dense and ultra-sparse regimes. Empirically, centering improves signal under imbalance, enabling more reliable detection of multiple blocks and more accurate label estimation, as demonstrated through extensive simulations and a political blog data example. The approach offers a tuning-free alternative to Bethe-Hessian methods, connects to non-backtracking walk theory, and provides practical recursive testing strategies for determining $K$ in real networks.

Abstract

Community structures represent a crucial aspect of network analysis, and various methods have been developed to identify these communities. However, a common hurdle lies in determining the number of communities K, a parameter that often requires estimation in practice. Existing approaches for estimating K face two notable challenges: the weak community signal present in sparse networks and the imbalance in community sizes or edge densities that result in unequal per-community expected degree. We propose a spectral method based on a novel network operator whose spectral properties effectively overcome both challenges. This operator is a refined version of the non-backtracking operator, adapted from a "centered" adjacency matrix. Its leading eigenvalues are more concentrated than those of the adjacency matrix for sparse networks, while they also demonstrate enhanced signal under imbalance scenarios, a benefit attributed to the centering step. This is justified, either theoretically or numerically, under the null model K = 1, in both dense and ultra-sparse settings. A goodness-of-fit test based on the leading eigenvalue can be applied to determine the number of communities K.

Determining the Number of Communities in Sparse and Imbalanced Settings

TL;DR

This work tackles the challenge of determining the number of communities in sparse and imbalanced networks. It introduces a centered non-backtracking–type operator, , whose leading eigenvalue serves as a robust goodness-of-fit statistic for testing , with strong theoretical results on null behavior and asymptotic power in both dense and ultra-sparse regimes. Empirically, centering improves signal under imbalance, enabling more reliable detection of multiple blocks and more accurate label estimation, as demonstrated through extensive simulations and a political blog data example. The approach offers a tuning-free alternative to Bethe-Hessian methods, connects to non-backtracking walk theory, and provides practical recursive testing strategies for determining in real networks.

Abstract

Community structures represent a crucial aspect of network analysis, and various methods have been developed to identify these communities. However, a common hurdle lies in determining the number of communities K, a parameter that often requires estimation in practice. Existing approaches for estimating K face two notable challenges: the weak community signal present in sparse networks and the imbalance in community sizes or edge densities that result in unequal per-community expected degree. We propose a spectral method based on a novel network operator whose spectral properties effectively overcome both challenges. This operator is a refined version of the non-backtracking operator, adapted from a "centered" adjacency matrix. Its leading eigenvalues are more concentrated than those of the adjacency matrix for sparse networks, while they also demonstrate enhanced signal under imbalance scenarios, a benefit attributed to the centering step. This is justified, either theoretically or numerically, under the null model K = 1, in both dense and ultra-sparse settings. A goodness-of-fit test based on the leading eigenvalue can be applied to determine the number of communities K.
Paper Structure (34 sections, 8 theorems, 106 equations, 12 figures, 2 tables)

This paper contains 34 sections, 8 theorems, 106 equations, 12 figures, 2 tables.

Key Result

Proposition 1

Suppose a network $A$ is generated from an Erdős-Rényi model $G(n, p)$. Assume $\alpha / \log n \to \infty$. We have and Moreover, if we assume $\alpha = \Omega (n^{2/3 + \epsilon})$ for some $\epsilon>0$, Conjecture conjecture:concentration_of_v1Dv1 implies that and

Figures (12)

  • Figure 1: Growth rate of $\underline{v}_1^\top \underline{D} \ \underline{v}_1$ under different $p(n)$. Here we take $p\asymp 1$, $p\asymp n^{-1/3}$, $p\asymp n^{-1/2}$ and $p\asymp n^{-1}$, respectively. All of them appear to align with the bound \ref{['eq:v1Dv1_is_op_n^eps']} in Conjecture \ref{['conjecture:concentration_of_v1Dv1']}.
  • Figure 2: Asymptotic order of the pairwise differences between $\mu_1(\widetilde{H})$, $y_1^{\top} \widetilde{H} x_1$, and $\lambda_1(\frac{\underline{A}}{\sqrt{\alpha}})$. We take $p\asymp n^{-1/3}$ in the left panel representing the denser regime, and $p\asymp n^{-1}$ in the right panel representing the ultra-sparse regime. In both regimes, $y_1^{\top} \widetilde{H} x_1$ closely approximates $\mu_1(\widetilde{H})$. The main contributor to the difference between $\mu_1(\widetilde{H})$ and $\lambda_1(\frac{\underline{A}}{\sqrt{\alpha}})$ comes from $y_1^{\top} \widetilde{H} x_1 - \lambda_1(\frac{\underline{A}}{\sqrt{\alpha}})$.
  • Figure 3: Null distributions of $\lambda_1(\widetilde{A})$ and $\mu_1(\underline{H})$ with $n^{2/3}$ scaling. We fix the average degree at 3 in the left panel, representing the ultra-sparse regime, and fix $p_0 = 0.08$ in the right panel. We let $n$ take various values, indicated by different colors. The blue dashed line represents the Tracy-Widom distribution with index 1.
  • Figure 4: Testing $H_1: K>1$ versus $H_0: K=1$ under sparsity. The two communities have equal sizes $n_1=n_2=250$. We set $p_0=0.01$ and let $Q_{11}$ and $Q_{22}$ grow with $\delta$ simultaneously as in \ref{['eq:P11_P22_balanced_P']}. The left panel shows the spectrum of each operator, where we fix $\delta = 0.6$. The right panel shows how the distribution of each test statistic changes with $\delta$. Also shown is the power curve, where the rejection rule is based on the $(1-\alpha)$-quantile of the null distribution. The values of the test statistic correspond to the left $y$-axis, while the power values correspond to the right $y$-axis.
  • Figure 5: Testing $H_1: K>1$ versus $H_0: K=1$ under community size imbalance $n_1 \ne n_2$. The larger community has $n_1=400$, and the smaller community has $n_2=100$. We set $p_0=0.08$ and let $Q_{11}$ and $Q_{22}$ grow with $\delta$ simultaneously \ref{['eq:P11_P22_balanced_P']}. We fix $\delta = 0.4$ for the spectra on the left.
  • ...and 7 more figures

Theorems & Definitions (16)

  • Conjecture 1: Concentration of $\underline{v}_1^\top \underline{D} \ \underline{v}_1$
  • Proposition 1: Tracy-Widom limit of $\mu_1(\underline{H})$
  • Proposition 2: Partial cancellation
  • Theorem 1: Growth rate of $\lambda_1(\widehat{\underline{A}})$
  • Proposition 3: Asymptotic difference between $\mu_1(\underline{H})$ and $\lambda_1(\underline{A})$
  • Proposition 4: Centering enhances signal under imbalance
  • Proposition 5: Spectral equivalence of $\underline{B}$ and $\underline{H}$
  • proof : Proof of Proposition \ref{['proposition:convergence_of_H_to_TW1']}
  • Theorem 2: Theorem 4.4 in demmel1997applied
  • proof : Proof of Proposition \ref{['proposition:order_of_vDv_constant_degree']}
  • ...and 6 more