Table of Contents
Fetching ...

Recovering Small Communities in the Planted Partition Model

Martijn Gösgens, Maximilien Dreveton

TL;DR

This work addresses the problem of recovering planted partitions in the Planted Partition Model (PPM) with an arbitrarily large number of communities and heterogeneous sizes. It introduces Diamond Percolation, a parameter-free, triangle-based refinement that derives detected communities from the observed graph via the edges with at least two common neighbors, and analyzes recovery using a correlation-based criterion $\rho(C_n,T_n)$. The paper proves exact, almost exact, and weak recovery guarantees under mild assumptions, including power-law partitions, without requiring prior knowledge of the number of communities $k_n$ or the size distribution, thereby extending classic results for balanced SBM-like models to unbalanced and growing partitions. It also provides a detailed treatment of power-law partitions, establishing recovery guarantees across regimes for the number of communities and intra-community density, and discusses practical extensions and future research directions for more complex network models and heterogeneity. Overall, the results offer a scalable, provably effective approach for community detection in realistic networks with heavy-tailed community sizes and unknown community counts, grounded in a simple, triangle-based refinement connected to classic common-neighbor ideas.

Abstract

We analyze community recovery in the planted partition model (PPM) in regimes where the number of communities is arbitrarily large. We examine the three standard recovery regimes: exact recovery, almost exact recovery, and weak recovery. When communities vary in size, traditional accuracy- or alignment-based metrics become unsuitable for assessing the correctness of a predicted partition. To address this, we redefine these recovery regimes using the correlation coefficient, a more versatile metric for comparing partitions. We then demonstrate that $\textit{Diamond Percolation}$, an algorithm based on common-neighbors, successfully recovers communities under mild assumptions on edge probabilities, with minimal restrictions on the number and sizes of communities. As a key application, we consider the case where community sizes follow a power-law distribution, a characteristic frequently found in real-world networks. To the best of our knowledge, we provide the first recovery results for such unbalanced partitions.

Recovering Small Communities in the Planted Partition Model

TL;DR

This work addresses the problem of recovering planted partitions in the Planted Partition Model (PPM) with an arbitrarily large number of communities and heterogeneous sizes. It introduces Diamond Percolation, a parameter-free, triangle-based refinement that derives detected communities from the observed graph via the edges with at least two common neighbors, and analyzes recovery using a correlation-based criterion . The paper proves exact, almost exact, and weak recovery guarantees under mild assumptions, including power-law partitions, without requiring prior knowledge of the number of communities or the size distribution, thereby extending classic results for balanced SBM-like models to unbalanced and growing partitions. It also provides a detailed treatment of power-law partitions, establishing recovery guarantees across regimes for the number of communities and intra-community density, and discusses practical extensions and future research directions for more complex network models and heterogeneity. Overall, the results offer a scalable, provably effective approach for community detection in realistic networks with heavy-tailed community sizes and unknown community counts, grounded in a simple, triangle-based refinement connected to classic common-neighbor ideas.

Abstract

We analyze community recovery in the planted partition model (PPM) in regimes where the number of communities is arbitrarily large. We examine the three standard recovery regimes: exact recovery, almost exact recovery, and weak recovery. When communities vary in size, traditional accuracy- or alignment-based metrics become unsuitable for assessing the correctness of a predicted partition. To address this, we redefine these recovery regimes using the correlation coefficient, a more versatile metric for comparing partitions. We then demonstrate that , an algorithm based on common-neighbors, successfully recovers communities under mild assumptions on edge probabilities, with minimal restrictions on the number and sizes of communities. As a key application, we consider the case where community sizes follow a power-law distribution, a characteristic frequently found in real-world networks. To the best of our knowledge, we provide the first recovery results for such unbalanced partitions.

Paper Structure

This paper contains 47 sections, 16 theorems, 159 equations, 2 figures, 1 algorithm.

Key Result

Lemma 1

Algorithm alg:commonNeighborsPartitioning has $\mathcal{O}(n+|E|)$ space complexity and $\mathcal{O}(n+\sum_{i\in[n]}d_i^2)$ time complexity, where $d_i$ denotes the degree of vertex $i$ in $G$.

Figures (2)

  • Figure 1: Algorithm \ref{['alg:commonNeighborsPartitioning']} is illustrated on a PPM consisting of two equally-sized communities of size $10$ each, with $p=\tfrac{1}{2}$ and $q=\tfrac{1}{20}$. The true communities correspond to the red circles and blue squares. The solid lines are the edges of $G^*$, while the dashed lines are the edges of $G$ that are not retained in $G^*$. The orange shaded regions represent the detected communities. We see that the two communities are correctly separated, but that two vertices are incorrectly isolated.
  • Figure 2: Figure \ref{['fig:scores-p']}: Estimation of the quantity $\Delta$ defined in \ref{['eq:weak-balanced']}. For each estimate, we sample $5000$ random graphs from $ER(s,p)$ and apply Algorithm \ref{['alg:commonNeighborsPartitioning']} to each of them. Figure \ref{['fig:scores']}: Comparison of the performance of Algorithm \ref{['alg:commonNeighborsPartitioning']} to the estimated asymptotic performance established in Equation \ref{['eq:weak-balanced']}, when $T_{n}\sim\mathrm{Balanced}(k\cdot s,k,s)$ with $p=0.5$, $s=11$, and $q=5/(k\cdot s-s)$ (so that in expectation, every vertex has five neighbors inside and outside its community).

Theorems & Definitions (40)

  • Lemma 1
  • Theorem 2
  • Lemma 3
  • Theorem 4
  • Example 1
  • Example 2
  • Example 3
  • Example 4
  • Theorem 5
  • Example 5
  • ...and 30 more