Table of Contents
Fetching ...

A Sublinear-Time Spectral Clustering Oracle with Improved Preprocessing Time

Ranran Shen, Pan Peng

TL;DR

This work develops a sublinear-time spectral clustering oracle for graphs that are $(k,\varphi,\varepsilon)$-clusterable, enabling WhichCluster queries after sublinear preprocessing and query phases. The main advance is to abandon explicit center guessing and instead exploit the dot-product structure of spectral embeddings by constructing a sampling-based similarity graph on a small set of vertices, ensuring clusters form connected components with high probability. The oracle achieves preprocessing and query times of the form $O(n^{1/2+O(\varepsilon/\varphi^2)}\cdot \mathrm{poly}(k\log n/\gamma\varphi))$, space $O(n^{1/2+O(\varepsilon/\varphi^2)}\cdot \mathrm{poly}(k\log n/\gamma))$, and misclassification error $O(\mathrm{poly}(k)\cdot \varepsilon^{1/3})|C_i|$, while being robust to a small number of random edge deletions. The approach yields practical, implementable sublinear clustering with strong theoretical guarantees and is validated on synthetic SBM graphs, showing feasible performance with reduced data access. Overall, the paper advances sublinear clustering by reducing preprocessing costs and space while maintaining meaningful clustering accuracy and resilience to perturbations.

Abstract

We address the problem of designing a sublinear-time spectral clustering oracle for graphs that exhibit strong clusterability. Such graphs contain $k$ latent clusters, each characterized by a large inner conductance (at least $\varphi$) and a small outer conductance (at most $\varepsilon$). Our aim is to preprocess the graph to enable clustering membership queries, with the key requirement that both preprocessing and query answering should be performed in sublinear time, and the resulting partition should be consistent with a $k$-partition that is close to the ground-truth clustering. Previous oracles have relied on either a $\textrm{poly}(k)\log n$ gap between inner and outer conductances or exponential (in $k/\varepsilon$) preprocessing time. Our algorithm relaxes these assumptions, albeit at the cost of a slightly higher misclassification ratio. We also show that our clustering oracle is robust against a few random edge deletions. To validate our theoretical bounds, we conducted experiments on synthetic networks.

A Sublinear-Time Spectral Clustering Oracle with Improved Preprocessing Time

TL;DR

This work develops a sublinear-time spectral clustering oracle for graphs that are -clusterable, enabling WhichCluster queries after sublinear preprocessing and query phases. The main advance is to abandon explicit center guessing and instead exploit the dot-product structure of spectral embeddings by constructing a sampling-based similarity graph on a small set of vertices, ensuring clusters form connected components with high probability. The oracle achieves preprocessing and query times of the form , space , and misclassification error , while being robust to a small number of random edge deletions. The approach yields practical, implementable sublinear clustering with strong theoretical guarantees and is validated on synthetic SBM graphs, showing feasible performance with reduced data access. Overall, the paper advances sublinear clustering by reducing preprocessing costs and space while maintaining meaningful clustering accuracy and resilience to perturbations.

Abstract

We address the problem of designing a sublinear-time spectral clustering oracle for graphs that exhibit strong clusterability. Such graphs contain latent clusters, each characterized by a large inner conductance (at least ) and a small outer conductance (at most ). Our aim is to preprocess the graph to enable clustering membership queries, with the key requirement that both preprocessing and query answering should be performed in sublinear time, and the resulting partition should be consistent with a -partition that is close to the ground-truth clustering. Previous oracles have relied on either a gap between inner and outer conductances or exponential (in ) preprocessing time. Our algorithm relaxes these assumptions, albeit at the cost of a slightly higher misclassification ratio. We also show that our clustering oracle is robust against a few random edge deletions. To validate our theoretical bounds, we conducted experiments on synthetic networks.
Paper Structure (20 sections, 13 theorems, 25 equations, 2 figures, 4 tables, 8 algorithms)

This paper contains 20 sections, 13 theorems, 25 equations, 2 figures, 4 tables, 8 algorithms.

Key Result

Theorem 1

Let $k\ge 2$ be an integer, $\varphi\in (0,1)$. Let $G=(V,E)$ be a $d$-regular $n$-vertex graph that admits a $(k,\varphi,\varepsilon)$-clustering $C_1,\dots,C_k$, $\frac{\varepsilon}{\varphi^2}\ll \frac{\gamma^3}{k^{\frac{9}{2}}\cdot \log^3k}$ and for all $i\in[k]$, $\gamma\frac{n}{k}\le |C_i|\le \

Figures (2)

  • Figure 1: The angle between embeddings of vertices in the same cluster is small and the angle between embeddings of vertices in different clusters is close to orthogonal $(k=3)$.
  • Figure 2: For a random graph $G$ generated by SBM with $n=3000,k=3,p=0.03,q=0.002$, we build the dot product oracle for several different parameters for $t,s,R_{init}, R_{query}$ and plot the density graph. The setting with the most prominent gap in the density graph, i.e., the one on the right, is selected. We can further set $\theta=0.0005$ for $G$ according to the right graph.

Theorems & Definitions (26)

  • Definition 1.1: Inner and outer conductance
  • Definition 1.2: $k$-partition
  • Definition 1.3: $(k,\varphi,\varepsilon)$-clustering
  • Theorem 1
  • Theorem 2: Informal; Robust against random edge deletions
  • Definition 2.1: Spectral embedding
  • Definition 2.2: Cluster centers
  • Lemma 2.1: Theorem 2, gluch2021spectral
  • Lemma 4.1
  • proof
  • ...and 16 more