Spectral Clustering with Side Information

Hendrik Fichtenberger; Michael Kapralov; Ekaterina Kochetkova; Silvio Lattanzi; Davide Mazzali; Weronika Wrzos-Kaminska

Spectral Clustering with Side Information

Hendrik Fichtenberger, Michael Kapralov, Ekaterina Kochetkova, Silvio Lattanzi, Davide Mazzali, Weronika Wrzos-Kaminska

TL;DR

The paper addresses spectral clustering on graphs with planted k-cluster structure augmented by noisy vertex labels, and asks whether combining graph structure with side information can yield near-optimal recovery rates. It develops two core advances: a sublinear-time classifier that achieves a misclassification rate around $\widetilde{O}(\varepsilon\delta)$ by exploiting spectral structure and a robust reachability test in cross-graphs, and a polynomial-time edge-reweighting technique (via SDP) that morphs the input into a graph with improved multi-way conductance, enabling a clustering that is $\widetilde{O}(\varepsilon\delta)$-close to the target while preserving expansion. The methods hinge on a careful analysis of spectral clusters, impostors, cross graphs, and label clusters, and rely on approximate spectral inner-product oracles to enable scalable computation. Together, these results demonstrate that side information can substantially improve clustering accuracy and can be used to refine community structure in near-linear or sublinear time, with strong theoretical guarantees under a worst-case clusterable model. The work has practical relevance for large-scale graph analytics where both structure and noisy annotations are available, offering sublinear data-structures and SDP-based reweighting tools to achieve near-optimal recovery.

Abstract

In the graph clustering problem with a planted solution, the input is a graph on $n$ vertices partitioned into $k$ clusters, and the task is to infer the clusters from graph structure. A standard assumption is that clusters induce well-connected subgraphs (i.e. $Ω(1)$-expanders), and form $ε$-sparse cuts. Such a graph defines the clustering uniquely up to $\approx ε$ misclassification rate, and efficient algorithms for achieving this rate are known. While this vanilla version of graph clustering is well studied, in practice, vertices of the graph are typically equipped with labels that provide additional information on cluster ids of the vertices. For example, each vertex could have a cluster label that is corrupted independently with probability $δ$. Using only one of the two sources of information leads to misclassification rate $\min\{ε, δ\}$, but can they be combined to achieve a rate of $\approx εδ$? In this paper, we give an affirmative answer to this question and present a sublinear-time algorithm in the number of vertices $n$. Our key algorithmic insight is a new observation on ``spectrally ambiguous'' vertices in a well-clusterable graph. While our sublinear-time classifier achieves the nearly optimal $\approx \widetilde O(εδ)$ misclassification rate, the approximate clusters that it outputs do not necessarily induce expanders in the graph $G$. In our second result, we give a polynomial-time algorithm that reweights edges of the original $(k, ε, Ω(1))$-clusterable graph to transform it into a $(k, \widetilde O(εδ), Ω(1))$-clusterable one (for constant $k$), improving sparsity of cuts nearly optimally and preserving expansion properties of the communities - an algorithm for refining community structure of the input graph.

Spectral Clustering with Side Information

TL;DR

by exploiting spectral structure and a robust reachability test in cross-graphs, and a polynomial-time edge-reweighting technique (via SDP) that morphs the input into a graph with improved multi-way conductance, enabling a clustering that is

-close to the target while preserving expansion. The methods hinge on a careful analysis of spectral clusters, impostors, cross graphs, and label clusters, and rely on approximate spectral inner-product oracles to enable scalable computation. Together, these results demonstrate that side information can substantially improve clustering accuracy and can be used to refine community structure in near-linear or sublinear time, with strong theoretical guarantees under a worst-case clusterable model. The work has practical relevance for large-scale graph analytics where both structure and noisy annotations are available, offering sublinear data-structures and SDP-based reweighting tools to achieve near-optimal recovery.

Abstract

In the graph clustering problem with a planted solution, the input is a graph on

vertices partitioned into

clusters, and the task is to infer the clusters from graph structure. A standard assumption is that clusters induce well-connected subgraphs (i.e.

-expanders), and form

-sparse cuts. Such a graph defines the clustering uniquely up to

misclassification rate, and efficient algorithms for achieving this rate are known. While this vanilla version of graph clustering is well studied, in practice, vertices of the graph are typically equipped with labels that provide additional information on cluster ids of the vertices. For example, each vertex could have a cluster label that is corrupted independently with probability

. Using only one of the two sources of information leads to misclassification rate

, but can they be combined to achieve a rate of

? In this paper, we give an affirmative answer to this question and present a sublinear-time algorithm in the number of vertices

. Our key algorithmic insight is a new observation on ``spectrally ambiguous'' vertices in a well-clusterable graph. While our sublinear-time classifier achieves the nearly optimal

misclassification rate, the approximate clusters that it outputs do not necessarily induce expanders in the graph

. In our second result, we give a polynomial-time algorithm that reweights edges of the original

-clusterable graph to transform it into a

-clusterable one (for constant

), improving sparsity of cuts nearly optimally and preserving expansion properties of the communities - an algorithm for refining community structure of the input graph.

Spectral Clustering with Side Information

TL;DR

Abstract

Spectral Clustering with Side Information

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (215)