Semi-Supervised Clustering of Sparse Graphs: Crossing the Information-Theoretic Threshold

Junda Sheng; Thomas Strohmer

Semi-Supervised Clustering of Sparse Graphs: Crossing the Information-Theoretic Threshold

Junda Sheng, Thomas Strohmer

TL;DR

The paper investigates semi-supervised clustering on sparse stochastic block models and shows that revealing a nonzero fraction of labels eliminates the classic KS and information-theoretic barriers to recovery. It introduces two efficient approaches—the census method (combinatorial majority-voting in local neighborhoods) and a constrained SDP (CSDP) framework—that integrate label information with graph structure to achieve detection across all parameter regimes. The authors provide non-asymptotic guarantees, including overlap bounds and a testing scheme that distinguishes SBM from ERM under semi-supervision, and they validate the theoretical findings with numerical experiments demonstrating phase-transition disappearance at modest reveal rates. Together, these results offer new insights into fundamental limits and robustness of SDP-based methods in semi-supervised graph clustering, with potential practical impact on real-world networks where partial labels are available.

Abstract

The stochastic block model is a canonical random graph model for clustering and community detection on network-structured data. Decades of extensive study on the problem have established many profound results, among which the phase transition at the Kesten-Stigum threshold is particularly interesting both from a mathematical and an applied standpoint. It states that no estimator based on the network topology can perform substantially better than chance on sparse graphs if the model parameter is below a certain threshold. Nevertheless, if we slightly extend the horizon to the ubiquitous semi-supervised setting, such a fundamental limitation will disappear completely. We prove that with an arbitrary fraction of the labels revealed, the detection problem is feasible throughout the parameter domain. Moreover, we introduce two efficient algorithms, one combinatorial and one based on optimization, to integrate label information with graph structures. Our work brings a new perspective to the stochastic model of networks and semidefinite program research.

Semi-Supervised Clustering of Sparse Graphs: Crossing the Information-Theoretic Threshold

TL;DR

Abstract

Paper Structure (19 sections, 21 theorems, 112 equations, 8 figures)

This paper contains 19 sections, 21 theorems, 112 equations, 8 figures.

Introduction
Clustering on Graphs
Sparse Regime and Kesten-Stigum Threshold
Basic Algorithms
Semi-Supervised Learning
Our Results
Proof Techniques
Outline
Notation
Census Method
Majority of t-Neighbors
Locally Tree-Like Structure
Majority of 1-Neighbors
Semi-Supervised SDP
SDP for Community Detection
...and 4 more sections

Key Result

Theorem 3

[Kesten-Stigum threshold] Let $\mathop{\mathrm{\mathcal{G}}}\nolimits(n, a/n, b/n)$ be a symmetric SBM with two balanced clusters and $a, b = O(1)$. The weak recovery problem is solvable and efficiently so, if and only if $(a-b)^2 > 2(a + b)$.

Figures (8)

Figure 1: The left image represents the adjacency matrix of one realization of $\mathop{\mathrm{\mathcal{G}}}\nolimits (100, 0.12, 0.05)$, where the detection is theoretically possible. Yet the data is given non-colored (middle) and also non-ordered (right).
Figure 2: Neighborhood of node $v$ with a tree structure. The ground truth of clusters is coded in black and white. The shaded area indicates those nodes randomly guessed to be in the same community or the opposite community as $v$. The annulus represents the collection of its $t$-neighbors.
Figure 3: The simulation result of $\mathop{\mathrm{\mathcal{G}}}\nolimits(3000, 5/3000, 2/3000)$, $\text{SNR}\approx0.64$. Solid curves stand for the average overlaps of the t-neighbors census method ($t$ = 1, 2, and 3) on 60 independent realizations of the random graph. The shaded area represents the standard error band of the 1-neighbors census. The dashed curve stands for the asymptotic lower bound we conclude from our calculation.
Figure 4: Disappearance of the phase transition.
Figure 5: Overlap heatmaps of the unsupervised (left) and the semi-supervised (right) SDPs. The coordinates correspond to the model parameters $a$ and $b$. The solid line represents the KS and information-theoretic threshold. The dashed line corresponds to $a=b$.
...and 3 more figures

Theorems & Definitions (34)

Definition 1: Planted bisection model
Definition 2
Theorem 3
Theorem 4
Theorem 5
Definition 6: Semi-supervised planted bisection model
Remark 7
Remark 8
Remark 9
Definition 10
...and 24 more

Semi-Supervised Clustering of Sparse Graphs: Crossing the Information-Theoretic Threshold

TL;DR

Abstract

Semi-Supervised Clustering of Sparse Graphs: Crossing the Information-Theoretic Threshold

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (34)