Table of Contents
Fetching ...

Exact Recovery in the Data Block Model

Amir R. Asadi, Akbar Davoodi, Ramin Javadi, Farzad Parvaresh

TL;DR

This work studies exact recovery in the Data Block Model (DBM), a data-augmented stochastic block model, and introduces the Chernoff--TV divergence to sharply characterize the recovery threshold in the logarithmic-degree regime. It proves achievability via a polynomial-time two-stage algorithm that jointly leverages graph structure and vertex attributes, and provides a matching information-theoretic converse, establishing when exact recovery is impossible. The framework recovers the classical SBM threshold in the uninformative-data limit and reduces to data-only limits when the graph carries little information, thereby unifying several prior models of side information. Simulations in the two-community setting validate the phase transition and demonstrate substantial gains from vertex data, highlighting practical implications for incorporating side information in network clustering.

Abstract

Community detection in networks is a fundamental problem in machine learning and statistical inference, with applications in social networks, biological systems, and communication networks. The stochastic block model (SBM) serves as a canonical framework for studying community structure, and exact recovery, identifying the true communities with high probability, is a central theoretical question. While classical results characterize the phase transition for exact recovery based solely on graph connectivity, many real-world networks contain additional data, such as node attributes or labels. In this work, we study exact recovery in the Data Block Model (DBM), an SBM augmented with node-associated data, as formalized by Asadi, Abbe, and Verdú (2017). We introduce the Chernoff--TV divergence and use it to characterize a sharp exact recovery threshold for the DBM. We further provide an efficient algorithm that achieves this threshold, along with a matching converse result showing impossibility below the threshold. Finally, simulations validate our findings and demonstrate the benefits of incorporating vertex data as side information in community detection.

Exact Recovery in the Data Block Model

TL;DR

This work studies exact recovery in the Data Block Model (DBM), a data-augmented stochastic block model, and introduces the Chernoff--TV divergence to sharply characterize the recovery threshold in the logarithmic-degree regime. It proves achievability via a polynomial-time two-stage algorithm that jointly leverages graph structure and vertex attributes, and provides a matching information-theoretic converse, establishing when exact recovery is impossible. The framework recovers the classical SBM threshold in the uninformative-data limit and reduces to data-only limits when the graph carries little information, thereby unifying several prior models of side information. Simulations in the two-community setting validate the phase transition and demonstrate substantial gains from vertex data, highlighting practical implications for incorporating side information in network clustering.

Abstract

Community detection in networks is a fundamental problem in machine learning and statistical inference, with applications in social networks, biological systems, and communication networks. The stochastic block model (SBM) serves as a canonical framework for studying community structure, and exact recovery, identifying the true communities with high probability, is a central theoretical question. While classical results characterize the phase transition for exact recovery based solely on graph connectivity, many real-world networks contain additional data, such as node attributes or labels. In this work, we study exact recovery in the Data Block Model (DBM), an SBM augmented with node-associated data, as formalized by Asadi, Abbe, and Verdú (2017). We introduce the Chernoff--TV divergence and use it to characterize a sharp exact recovery threshold for the DBM. We further provide an efficient algorithm that achieves this threshold, along with a matching converse result showing impossibility below the threshold. Finally, simulations validate our findings and demonstrate the benefits of incorporating vertex data as side information in community detection.
Paper Structure (18 sections, 10 theorems, 104 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 10 theorems, 104 equations, 7 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.2

Let $P$ be a prior distribution on $[k]$, and let $\mathbf{Q}\in \mathbb{R}_+^{k\times k}$ be a symmetric matrix. For each $n\ge 1$, let Exact recovery for the sequence $\{(G_n,X^n)\}_{n\ge1}$ is achievable (and efficiently so) if for all $i,j\in[k]$ with $i\neq j$. Conversely, exact recovery is impossible if there exist $i\neq j$ such that where $(\mathop{\mathrm{diag}}\nolimits(P)\mathbf{Q})_

Figures (7)

  • Figure 1: Example of a DBM
  • Figure 2: $1-\mathrm{ERP}$ versus $a$ at fixed $\alpha$ ($n=1000$, $b=10$, $M=1000$). Vertical guidelines mark $a^\star_{\rm DBM}(b,\alpha)$ and $a^\star_{\rm SBM}(b)$.
  • Figure 3: Mean misclassification error (log scale) versus $a$ at fixed $\alpha$ ($n=1000$, $b=10$, $M=1000$).
  • Figure 4: Heatmaps of mean accuracy across the $(a,\alpha)$ grid. The DBM threshold curve $a^\star_{\rm DBM}(b,\alpha)=(\sqrt b+\sqrt{2(1-\alpha)})^2$ and the SBM threshold at $a^\star_{\rm SBM}(b)=(\sqrt b+\sqrt2)^2$ are overlaid.
  • Figure 5: Heatmaps of ERP across the $(a,\alpha)$ grid, with the same threshold overlays.
  • ...and 2 more figures

Theorems & Definitions (33)

  • Definition 2.1
  • Definition 2.2
  • Example 2.3
  • Definition 2.4
  • Definition 2.5
  • Definition 3.1: Chernoff--Hellinger divergence
  • Theorem 3.2: abbe015
  • Definition 3.3: Relative entropy
  • Definition 3.4: Chernoff information chernoff1952measure
  • Proposition 3.5: CH-divergence as Chernoff information asadi2017compressing
  • ...and 23 more