Community Detection for Contextual-LSBM: Theoretical Limitations of Misclassification Rate and Efficient Algorithms

Dian Jin; Yuqian Zhang; Qiaosheng Zhang

Community Detection for Contextual-LSBM: Theoretical Limitations of Misclassification Rate and Efficient Algorithms

Dian Jin, Yuqian Zhang, Qiaosheng Zhang

TL;DR

This work studies community detection in the Contextual-Labeled Stochastic Block Model (CLSBM), where network structure (LSBM) and Gaussian node attributes collectively inform community labels. It derives an information-theoretic lower bound on the misclassification rate, showing that any algorithm incurs at least $n \exp(-n D(\boldsymbol \alpha, \mathbf P, \boldsymbol{\mu}))$ misclassifications, where $D$ blends topological and attribute divergences. A practical, efficient spectral method is proposed via aggregation into a latent-factor model: $\mathbf S = \sum_l w_l \mathbf A_l + \frac{1}{n}\mathbf X^ op \mathbf X$, with a population mean $\mathbf M = \mathbf Z (\mathbf P_s + \frac{1}{n} \boldsymbol \mu^ op \boldsymbol \mu) \mathbf Z^ op - \text{diag}(\mathbf Z \mathbf P_s \mathbf Z^ op)$. Theoretical guarantees show a polynomial misclassification rate bound $s(\bar{\sigma}) \lesssim \frac{K}{n \cdot \mathrm{SNR}}$, where $\mathrm{SNR}$ captures the separation in the aggregated signal; this provides a valuable initialization for further refinement toward potentially achieving the optimal exponential rate. Overall, the paper establishes a benchmark for CLSBM performance, linking it to CH-divergence and Gaussian-mixture limits, and motivates refinement steps to bridge polynomial and exponential recovery regimes.

Abstract

The integration of network information and node attribute information has recently gained significant attention in the community detection literature. In this work, we consider community detection in the Contextual Labeled Stochastic Block Model (CLSBM), where the network follows an LSBM and node attributes follow a Gaussian Mixture Model (GMM). Our primary focus is the misclassification rate, which measures the expected number of nodes misclassified by community detection algorithms. We first establish a lower bound on the optimal misclassification rate that holds for any algorithm. When we specialize our setting to the LSBM (which preserves only network information) or the GMM (which preserves only node attribute information), our lower bound recovers prior results. Moreover, we present an efficient spectral-based algorithm tailored for the CLSBM and derive an upper bound on its misclassification rate. Although the algorithm does not attain the lower bound, it serves as a reliable starting point for designing more accurate community detection algorithms (as many algorithms use spectral method as an initial step, followed by refinement procedures to enhance accuracy).

Community Detection for Contextual-LSBM: Theoretical Limitations of Misclassification Rate and Efficient Algorithms

TL;DR

misclassifications, where

blends topological and attribute divergences. A practical, efficient spectral method is proposed via aggregation into a latent-factor model:

, with a population mean

. Theoretical guarantees show a polynomial misclassification rate bound

, where

captures the separation in the aggregated signal; this provides a valuable initialization for further refinement toward potentially achieving the optimal exponential rate. Overall, the paper establishes a benchmark for CLSBM performance, linking it to CH-divergence and Gaussian-mixture limits, and motivates refinement steps to bridge polynomial and exponential recovery regimes.

Abstract

Paper Structure (20 sections, 5 theorems, 51 equations, 1 algorithm)

This paper contains 20 sections, 5 theorems, 51 equations, 1 algorithm.

Introduction
Organization
Notations
Related work and main contributions
Model description and main result
Model definition
Lower bound on the number of misclassification nodes
Spectral community detection for latent factor model
Aggregated latent factor model
Theoretical guarantee
Conclusion
Proof of \ref{['thm:main']}
Definition of the perturbed model $\Psi$
Part 1
Part 2
...and 5 more sections

Key Result

Theorem 1

Denote $\bar{p}:=\max_{i,j,l\geq 1}\mathbf P(i,j,l)$, grant ass:1 and assume $\bar{p}=\omega(1/n)$, $\bar{p}=o(1)$, and $\eta_2=o(n)$. Let $s=o(n)$. If there exists an algorithm that asymptotically has fewer misclassified nodes than $s$ in expectation, i.e., $\limsup_{n\rightarrow \infty}\frac{\bar{ where $\bar{s}$ is defined in def:misrate.

Theorems & Definitions (15)

Definition 1: LSBM
Definition 2: CLSBM
Theorem 1: Lower bound
proof
Remark 1
Lemma 1
proof
proof
proof
Lemma 2
...and 5 more

Community Detection for Contextual-LSBM: Theoretical Limitations of Misclassification Rate and Efficient Algorithms

TL;DR

Abstract

Community Detection for Contextual-LSBM: Theoretical Limitations of Misclassification Rate and Efficient Algorithms

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (15)