Spectral Clustering with Side Information
Hendrik Fichtenberger, Michael Kapralov, Ekaterina Kochetkova, Silvio Lattanzi, Davide Mazzali, Weronika Wrzos-Kaminska
TL;DR
The paper addresses spectral clustering on graphs with planted k-cluster structure augmented by noisy vertex labels, and asks whether combining graph structure with side information can yield near-optimal recovery rates. It develops two core advances: a sublinear-time classifier that achieves a misclassification rate around $\widetilde{O}(\varepsilon\delta)$ by exploiting spectral structure and a robust reachability test in cross-graphs, and a polynomial-time edge-reweighting technique (via SDP) that morphs the input into a graph with improved multi-way conductance, enabling a clustering that is $\widetilde{O}(\varepsilon\delta)$-close to the target while preserving expansion. The methods hinge on a careful analysis of spectral clusters, impostors, cross graphs, and label clusters, and rely on approximate spectral inner-product oracles to enable scalable computation. Together, these results demonstrate that side information can substantially improve clustering accuracy and can be used to refine community structure in near-linear or sublinear time, with strong theoretical guarantees under a worst-case clusterable model. The work has practical relevance for large-scale graph analytics where both structure and noisy annotations are available, offering sublinear data-structures and SDP-based reweighting tools to achieve near-optimal recovery.
Abstract
In the graph clustering problem with a planted solution, the input is a graph on $n$ vertices partitioned into $k$ clusters, and the task is to infer the clusters from graph structure. A standard assumption is that clusters induce well-connected subgraphs (i.e. $Ω(1)$-expanders), and form $ε$-sparse cuts. Such a graph defines the clustering uniquely up to $\approx ε$ misclassification rate, and efficient algorithms for achieving this rate are known. While this vanilla version of graph clustering is well studied, in practice, vertices of the graph are typically equipped with labels that provide additional information on cluster ids of the vertices. For example, each vertex could have a cluster label that is corrupted independently with probability $δ$. Using only one of the two sources of information leads to misclassification rate $\min\{ε, δ\}$, but can they be combined to achieve a rate of $\approx εδ$? In this paper, we give an affirmative answer to this question and present a sublinear-time algorithm in the number of vertices $n$. Our key algorithmic insight is a new observation on ``spectrally ambiguous'' vertices in a well-clusterable graph. While our sublinear-time classifier achieves the nearly optimal $\approx \widetilde O(εδ)$ misclassification rate, the approximate clusters that it outputs do not necessarily induce expanders in the graph $G$. In our second result, we give a polynomial-time algorithm that reweights edges of the original $(k, ε, Ω(1))$-clusterable graph to transform it into a $(k, \widetilde O(εδ), Ω(1))$-clusterable one (for constant $k$), improving sparsity of cuts nearly optimally and preserving expansion properties of the communities - an algorithm for refining community structure of the input graph.
