Table of Contents
Fetching ...

On the Robustness of Spectral Algorithms for Semirandom Stochastic Block Models

Aditya Bhaskara, Agastya Vibhuti Jha, Michael Kapralov, Naren Sarayu Manoj, Davide Mazzali, Weronika Wrzos-Kaminska

TL;DR

This work investigates the robustness of spectral clustering for graph bisection under semirandom adversaries. By analyzing both the nonhomogeneous NSSBM and a deterministic-clusters model, it demonstrates that unnormalized spectral bisection can achieve exact recovery under realistic gap conditions, even when an adversary subtly strengthens within-cluster edges. In contrast, normalized spectral clustering can be provably inconsistent on certain NSSBM instances, revealing a nuanced trade-off between normalization and robustness. Complementary numerical experiments corroborate the theory, underscoring the practical relevance of unnormalized spectral methods in semirandom settings.

Abstract

In a graph bisection problem, we are given a graph $G$ with two equally-sized unlabeled communities, and the goal is to recover the vertices in these communities. A popular heuristic, known as spectral clustering, is to output an estimated community assignment based on the eigenvector corresponding to the second smallest eigenvalue of the Laplacian of $G$. Spectral algorithms can be shown to provably recover the cluster structure for graphs generated from certain probabilistic models, such as the Stochastic Block Model (SBM). However, spectral clustering is known to be non-robust to model mis-specification. Techniques based on semidefinite programming have been shown to be more robust, but they incur significant computational overheads. In this work, we study the robustness of spectral algorithms against semirandom adversaries. Informally, a semirandom adversary is allowed to ``helpfully'' change the specification of the model in a way that is consistent with the ground-truth solution. Our semirandom adversaries in particular are allowed to add edges inside clusters or increase the probability that an edge appears inside a cluster. Semirandom adversaries are a useful tool to determine the extent to which an algorithm has overfit to statistical assumptions on the input. On the positive side, we identify classes of semirandom adversaries under which spectral bisection using the _unnormalized_ Laplacian is strongly consistent, i.e., it exactly recovers the planted partitioning. On the negative side, we show that in these classes spectral bisection with the _normalized_ Laplacian outputs a partitioning that makes a classification mistake on a constant fraction of the vertices. Finally, we demonstrate numerical experiments that complement our theoretical findings.

On the Robustness of Spectral Algorithms for Semirandom Stochastic Block Models

TL;DR

This work investigates the robustness of spectral clustering for graph bisection under semirandom adversaries. By analyzing both the nonhomogeneous NSSBM and a deterministic-clusters model, it demonstrates that unnormalized spectral bisection can achieve exact recovery under realistic gap conditions, even when an adversary subtly strengthens within-cluster edges. In contrast, normalized spectral clustering can be provably inconsistent on certain NSSBM instances, revealing a nuanced trade-off between normalization and robustness. Complementary numerical experiments corroborate the theory, underscoring the practical relevance of unnormalized spectral methods in semirandom settings.

Abstract

In a graph bisection problem, we are given a graph with two equally-sized unlabeled communities, and the goal is to recover the vertices in these communities. A popular heuristic, known as spectral clustering, is to output an estimated community assignment based on the eigenvector corresponding to the second smallest eigenvalue of the Laplacian of . Spectral algorithms can be shown to provably recover the cluster structure for graphs generated from certain probabilistic models, such as the Stochastic Block Model (SBM). However, spectral clustering is known to be non-robust to model mis-specification. Techniques based on semidefinite programming have been shown to be more robust, but they incur significant computational overheads. In this work, we study the robustness of spectral algorithms against semirandom adversaries. Informally, a semirandom adversary is allowed to ``helpfully'' change the specification of the model in a way that is consistent with the ground-truth solution. Our semirandom adversaries in particular are allowed to add edges inside clusters or increase the probability that an edge appears inside a cluster. Semirandom adversaries are a useful tool to determine the extent to which an algorithm has overfit to statistical assumptions on the input. On the positive side, we identify classes of semirandom adversaries under which spectral bisection using the _unnormalized_ Laplacian is strongly consistent, i.e., it exactly recovers the planted partitioning. On the negative side, we show that in these classes spectral bisection with the _normalized_ Laplacian outputs a partitioning that makes a classification mistake on a constant fraction of the vertices. Finally, we demonstrate numerical experiments that complement our theoretical findings.

Paper Structure

This paper contains 27 sections, 35 theorems, 147 equations, 5 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Let $p,\overline{p},q$ be probabilities such that $q < p \le \overline{p}$ and such that $\alpha \coloneqq \overline{p}/(p-q)$ is an arbitrary constant. Let $\mathcal{D} \in \mathsf{NSSBM}(n,p,\overline{p},q)$. Let $n \ge N(\alpha)$ where the function $N(\alpha)$ only depends on $\alpha$. There exis then unnormalized spectral bisection is strongly consistent on $\mathcal{D}$.

Figures (5)

  • Figure 1: Top left, bottom left: Agreement with the planted bisection of the bipartition obtained from several matrices associated with an input graph generated from a distribution in $\mathsf{NSSBM}(n,p,\overline{p},q)$ for fixed values of $n,p,q$ and varying values of $\overline{p}$. In the top left plot, the bipartition is the $0$-cut of the second eigenvector, as in \ref{['alg:spectral_general']}. In the bottom left plot, the bipartition is the sweep cut of the first $n/2$ vertices in the second eigenvector. The dashed vertical line corresponds to $\overline{p}_{\textsf{max}}=\overline{p}_{\textsf{max}}(n,p,q)$ (see \ref{['eq:defpbarmax']}), and the solid vertical line corresponds to $\overline{p}_{\textsf{thr}}=\overline{p}_{\textsf{thr}}(n,p,q)$ (see \ref{['eq:defpbarthr']}). Top middle, top right, bottom middle: Embedding of the vertices given by the second eigenvector $\bm{u}_2$ of several matrices associated with a graph sampled from $\mathcal{D}_{p,\overline{p},q}$ with $\overline{p} = \overline{p}_{\textsf{thr}}$. Horizontal dashed lines, from top to bottom, correspond to $1/\sqrt{n}, 0, -1/\sqrt{n}$ respectively. Bottom right: Variance of the embedding in the second eigenvector $\bm{u}_2$ of the unnormalized Laplacian with respect to the ideal eigenvector $\bm{u}_2^{\star}$ (see \ref{['eq:variance']}), for input graphs generated from a distribution in $\mathsf{NSSBM}(n,p,\overline{p},q)$ with fixed values of $n,p,q$ and varying values of $\overline{p}$.
  • Figure 2: Agreement with the planted bisection of the bipartition obtained from unnormalized spectral bisection, for graphs generated from a distribution in $\mathsf{NSSBM}(n,p,\overline{p},q)$ for fixed values of $n,\overline{p}$ and varying values of $p>q$. The left plot uses $\overline{p}=1/2$, the right plot uses $\overline{p}=1$. The solid red curves plot the function $p_{\mathsf{thr}}(q)$ (see \ref{['eq:thr']}), and the dashed red curves plot the function $p_{\mathsf{info}}(q)$ (see \ref{['eq:info']}).
  • Figure 3: Agreement with the planted bisection of the bipartition obtained from several matrices associated with an input graph generated from a distribution $\mathcal{D}_{q}^{G_1,G_2} \in \mathsf{DCM}(n,d_{\mathsf{in}},q)$ for fixed values of $n,q$ and varying the size of the planted clique $S$. In the left plot, the bipartition is the $0$-cut of the second eigenvector, as in \ref{['alg:spectral_general']}. In the right plot, the bipartition is the sweep cut of the first $n/2$ vertices in the second eigenvector.
  • Figure 4: The minimum in-cluster degree $d_{\mathsf{in}}$ and the spectral gap $\lambda_3(\widehat{\mathbf{L}})-\lambda_2(\widehat{\mathbf{L}})$ of distributions $\mathcal{D}_{q}^{G_1,G_2} \in \mathsf{DCM}(n,d_{\mathsf{in}},q)$ with fixed values of $n,q$ and varying the size of the planted clique $S$. The red horizontal line on the left corresponds to the value $nq+\sqrt{n}$, the red horizontal line on the right corresponds to the value $\sqrt{n} + nq+\sqrt{nq \log n} +\log n$.
  • Figure 5: Embedding of the vertices given by the second eigenvector $\bm{u}_2$ of several matrices associated with a graph sampled from a distribution $\mathcal{D}_{q}^{G_1,G_2} \in \mathsf{DCM}(n,d_{\mathsf{in}},q)$, with the size of the planted clique set to $|S|=2/5 \cdot n$. Horizontal dashed lines, from top to bottom, correspond to $1/\sqrt{n}, 0, -1/\sqrt{n}$ respectively.

Theorems & Definitions (69)

  • Definition 2.1: Adjacency matrix
  • Definition 2.2: Degree matrix
  • Definition 2.3: Unnormalized Laplacian
  • Definition 2.4: Normalized Laplacians
  • Definition 2.5: Unnormalized and normalized spectral bisection
  • Definition 2.6
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma A.1
  • ...and 59 more