Table of Contents
Fetching ...

Spectral Clustering with Side Information

Hendrik Fichtenberger, Michael Kapralov, Ekaterina Kochetkova, Silvio Lattanzi, Davide Mazzali, Weronika Wrzos-Kaminska

TL;DR

The paper addresses spectral clustering on graphs with planted k-cluster structure augmented by noisy vertex labels, and asks whether combining graph structure with side information can yield near-optimal recovery rates. It develops two core advances: a sublinear-time classifier that achieves a misclassification rate around $\widetilde{O}(\varepsilon\delta)$ by exploiting spectral structure and a robust reachability test in cross-graphs, and a polynomial-time edge-reweighting technique (via SDP) that morphs the input into a graph with improved multi-way conductance, enabling a clustering that is $\widetilde{O}(\varepsilon\delta)$-close to the target while preserving expansion. The methods hinge on a careful analysis of spectral clusters, impostors, cross graphs, and label clusters, and rely on approximate spectral inner-product oracles to enable scalable computation. Together, these results demonstrate that side information can substantially improve clustering accuracy and can be used to refine community structure in near-linear or sublinear time, with strong theoretical guarantees under a worst-case clusterable model. The work has practical relevance for large-scale graph analytics where both structure and noisy annotations are available, offering sublinear data-structures and SDP-based reweighting tools to achieve near-optimal recovery.

Abstract

In the graph clustering problem with a planted solution, the input is a graph on $n$ vertices partitioned into $k$ clusters, and the task is to infer the clusters from graph structure. A standard assumption is that clusters induce well-connected subgraphs (i.e. $Ω(1)$-expanders), and form $ε$-sparse cuts. Such a graph defines the clustering uniquely up to $\approx ε$ misclassification rate, and efficient algorithms for achieving this rate are known. While this vanilla version of graph clustering is well studied, in practice, vertices of the graph are typically equipped with labels that provide additional information on cluster ids of the vertices. For example, each vertex could have a cluster label that is corrupted independently with probability $δ$. Using only one of the two sources of information leads to misclassification rate $\min\{ε, δ\}$, but can they be combined to achieve a rate of $\approx εδ$? In this paper, we give an affirmative answer to this question and present a sublinear-time algorithm in the number of vertices $n$. Our key algorithmic insight is a new observation on ``spectrally ambiguous'' vertices in a well-clusterable graph. While our sublinear-time classifier achieves the nearly optimal $\approx \widetilde O(εδ)$ misclassification rate, the approximate clusters that it outputs do not necessarily induce expanders in the graph $G$. In our second result, we give a polynomial-time algorithm that reweights edges of the original $(k, ε, Ω(1))$-clusterable graph to transform it into a $(k, \widetilde O(εδ), Ω(1))$-clusterable one (for constant $k$), improving sparsity of cuts nearly optimally and preserving expansion properties of the communities - an algorithm for refining community structure of the input graph.

Spectral Clustering with Side Information

TL;DR

The paper addresses spectral clustering on graphs with planted k-cluster structure augmented by noisy vertex labels, and asks whether combining graph structure with side information can yield near-optimal recovery rates. It develops two core advances: a sublinear-time classifier that achieves a misclassification rate around by exploiting spectral structure and a robust reachability test in cross-graphs, and a polynomial-time edge-reweighting technique (via SDP) that morphs the input into a graph with improved multi-way conductance, enabling a clustering that is -close to the target while preserving expansion. The methods hinge on a careful analysis of spectral clusters, impostors, cross graphs, and label clusters, and rely on approximate spectral inner-product oracles to enable scalable computation. Together, these results demonstrate that side information can substantially improve clustering accuracy and can be used to refine community structure in near-linear or sublinear time, with strong theoretical guarantees under a worst-case clusterable model. The work has practical relevance for large-scale graph analytics where both structure and noisy annotations are available, offering sublinear data-structures and SDP-based reweighting tools to achieve near-optimal recovery.

Abstract

In the graph clustering problem with a planted solution, the input is a graph on vertices partitioned into clusters, and the task is to infer the clusters from graph structure. A standard assumption is that clusters induce well-connected subgraphs (i.e. -expanders), and form -sparse cuts. Such a graph defines the clustering uniquely up to misclassification rate, and efficient algorithms for achieving this rate are known. While this vanilla version of graph clustering is well studied, in practice, vertices of the graph are typically equipped with labels that provide additional information on cluster ids of the vertices. For example, each vertex could have a cluster label that is corrupted independently with probability . Using only one of the two sources of information leads to misclassification rate , but can they be combined to achieve a rate of ? In this paper, we give an affirmative answer to this question and present a sublinear-time algorithm in the number of vertices . Our key algorithmic insight is a new observation on ``spectrally ambiguous'' vertices in a well-clusterable graph. While our sublinear-time classifier achieves the nearly optimal misclassification rate, the approximate clusters that it outputs do not necessarily induce expanders in the graph . In our second result, we give a polynomial-time algorithm that reweights edges of the original -clusterable graph to transform it into a -clusterable one (for constant ), improving sparsity of cuts nearly optimally and preserving expansion properties of the communities - an algorithm for refining community structure of the input graph.

Paper Structure

This paper contains 46 sections, 66 theorems, 357 equations, 4 figures, 3 tables, 10 algorithms.

Key Result

Theorem 1.4

There is a $d\cdot n^{1/2+O(\epsilon)}\mathrm{poly}(\log(n/\delta))$-timeFor the stated preprocessing time, we additionally require $\epsilon \geq 1/\mathrm{poly}(\log(n))$. For more details, see thm:sublinear and rem:additional_bounds_eps. algorithm that, given $G$ and $\sigma$ as per im and respec

Figures (4)

  • Figure 1: In the clustering $C_1,C_2$ (dashed circles), the vertices in $M$ belong to $C_1$; in the clustering $\widetilde{C}_1,\widetilde{C}_2$ (dotted circles), the vertices in $M$ belong to $\widetilde{C}_2$. Since both are valid $(2,\epsilon,\phi)$-clustering of this graph, it is uninformative as to which cluster the vertices in $M$ belong to.
  • Figure 2: Illustration of the the cross graph $G_{i, j}$ (see \ref{['def:crossgraph']}). The axes represent three cluster means $\mu_i, \mu_j$ and $\mu_l$ for distinct $i,j,l \in [k]$. The vertex set of $G_{i, j}$ can be partitioned into $\mathrm{SpecCluster}(i)\cap \mathrm{LabelCluster}(j)$ (vertices inside the shaded circle labeled $\mathrm{SpecCluster}(i)$) and $\mathcal{X}$ (vertices inside the dashed circle labeled "cross vertices $\mathcal{X}$"). The vertices marked by stars illustrate vertices that truly belong to $C_j$. The black lines illustrate the edges with at least one endpoint in $\mathrm{SpecCluster}(i)\cap \mathrm{LabelCluster}(j)$, while edges with both endpoints in $\mathcal{X}$ are omitted. Note that the vertices in $\mathrm{SpecCluster}(i) \cap \mathrm{LabelCluster}(j)$ that truly belong to $C_j$ are typically connected to $\mathcal{X}$, while most of the vertices that do not belong to $C_j$ are not.
  • Figure 3: An illustration of a vertex $u$ in $\mathrm{SpecCluster}(i)$. The solid arrows point to the neighbors of $u$ with high projection on $\mu_i$, and dashed arrows point to the neighbors with small projection on $\mu_i$. The solid and dashed arrows balance each other.
  • Figure 4: The cluster $C_i'$ can be partitioned into the "core" $C_i \setminus X_i$ and the "reassigned" vertices $R_i$. For $S \subseteq C_i'$, we let $S_{core} = S \setminus R_i$ and $S_{out} = S \cap R_i$.

Theorems & Definitions (215)

  • Definition 1.1: Conductance
  • Definition 1.2: $(k,\epsilon,\phi)$-clustering
  • Remark 1.3
  • Theorem 1.4: Classifier, see \ref{['thm:sublinear']}
  • Remark 1.5
  • Theorem 1.6: Refining communities -- see \ref{['thm:round_random_sigma']}
  • Theorem 1.7: Refining communities, general -- see \ref{['thm:round_to_clustering']}
  • Lemma 2.1: Variance bound-- see \ref{['lemma:variancebound']}, \ref{['lemma:clustermeans']}
  • Definition 2.2: Spectral clusters and cross vertices -- see \ref{['def:spec']}
  • Remark 2.3
  • ...and 205 more