A Sublinear-Time Spectral Clustering Oracle with Improved Preprocessing Time
Ranran Shen, Pan Peng
TL;DR
This work develops a sublinear-time spectral clustering oracle for graphs that are $(k,\varphi,\varepsilon)$-clusterable, enabling WhichCluster queries after sublinear preprocessing and query phases. The main advance is to abandon explicit center guessing and instead exploit the dot-product structure of spectral embeddings by constructing a sampling-based similarity graph on a small set of vertices, ensuring clusters form connected components with high probability. The oracle achieves preprocessing and query times of the form $O(n^{1/2+O(\varepsilon/\varphi^2)}\cdot \mathrm{poly}(k\log n/\gamma\varphi))$, space $O(n^{1/2+O(\varepsilon/\varphi^2)}\cdot \mathrm{poly}(k\log n/\gamma))$, and misclassification error $O(\mathrm{poly}(k)\cdot \varepsilon^{1/3})|C_i|$, while being robust to a small number of random edge deletions. The approach yields practical, implementable sublinear clustering with strong theoretical guarantees and is validated on synthetic SBM graphs, showing feasible performance with reduced data access. Overall, the paper advances sublinear clustering by reducing preprocessing costs and space while maintaining meaningful clustering accuracy and resilience to perturbations.
Abstract
We address the problem of designing a sublinear-time spectral clustering oracle for graphs that exhibit strong clusterability. Such graphs contain $k$ latent clusters, each characterized by a large inner conductance (at least $\varphi$) and a small outer conductance (at most $\varepsilon$). Our aim is to preprocess the graph to enable clustering membership queries, with the key requirement that both preprocessing and query answering should be performed in sublinear time, and the resulting partition should be consistent with a $k$-partition that is close to the ground-truth clustering. Previous oracles have relied on either a $\textrm{poly}(k)\log n$ gap between inner and outer conductances or exponential (in $k/\varepsilon$) preprocessing time. Our algorithm relaxes these assumptions, albeit at the cost of a slightly higher misclassification ratio. We also show that our clustering oracle is robust against a few random edge deletions. To validate our theoretical bounds, we conducted experiments on synthetic networks.
