Table of Contents
Fetching ...

Coreset Spectral Clustering

Ben Jourdan, Gregory Schwartzman, Peter Macgregor, He Sun

TL;DR

This work addresses scalable graph clustering by uniting spectral clustering and kernel $k$-means through the equivalence between the normalised cut and kernel $k$-means, enabling efficient processing of large sparse graphs. It introduces Coreset Spectral Clustering (CSC), which builds an $oldsymbol{varepsilon}$-coreset and solves the normalised cut on the coreset, with a provable transfer of quality to the full graph. Theoretical results show that an $oldsymbol{alpha}$-approximation on the coreset yields an $O(oldsymbol{alpha})$-approximation on the original graph, and a faster coreset construction achieves $ ilde{O}( ext{min}(d_{avg},k) )$ time for sparse kernels. Empirically, CSC delivers significant speedups on large graphs and maintains or improves clustering quality, even as the number of clusters grows large, outperforming coreset kernel $k$-means and standard spectral clustering in several settings.

Abstract

Coresets have become an invaluable tool for solving $k$-means and kernel $k$-means clustering problems on large datasets with small numbers of clusters. On the other hand, spectral clustering works well on sparse graphs and has recently been extended to scale efficiently to large numbers of clusters. We exploit the connection between kernel $k$-means and the normalised cut problem to combine the benefits of both. Our main result is a coreset spectral clustering algorithm for graphs that clusters a coreset graph to infer a good labelling of the original graph. We prove that an $α$-approximation for the normalised cut problem on the coreset graph is an $O(α)$-approximation on the original. We also improve the running time of the state-of-the-art coreset algorithm for kernel $k$-means on sparse kernels, from $\tilde{O}(nk)$ to $\tilde{O}(n\cdot \min \{k, d_{avg}\})$, where $d_{avg}$ is the average number of non-zero entries in each row of the $n\times n$ kernel matrix. Our experiments confirm our coreset algorithm is asymptotically faster on large real-world graphs with many clusters, and show that our clustering algorithm overcomes the main challenge faced by coreset kernel $k$-means on sparse kernels which is getting stuck in local optima.

Coreset Spectral Clustering

TL;DR

This work addresses scalable graph clustering by uniting spectral clustering and kernel -means through the equivalence between the normalised cut and kernel -means, enabling efficient processing of large sparse graphs. It introduces Coreset Spectral Clustering (CSC), which builds an -coreset and solves the normalised cut on the coreset, with a provable transfer of quality to the full graph. Theoretical results show that an -approximation on the coreset yields an -approximation on the original graph, and a faster coreset construction achieves time for sparse kernels. Empirically, CSC delivers significant speedups on large graphs and maintains or improves clustering quality, even as the number of clusters grows large, outperforming coreset kernel -means and standard spectral clustering in several settings.

Abstract

Coresets have become an invaluable tool for solving -means and kernel -means clustering problems on large datasets with small numbers of clusters. On the other hand, spectral clustering works well on sparse graphs and has recently been extended to scale efficiently to large numbers of clusters. We exploit the connection between kernel -means and the normalised cut problem to combine the benefits of both. Our main result is a coreset spectral clustering algorithm for graphs that clusters a coreset graph to infer a good labelling of the original graph. We prove that an -approximation for the normalised cut problem on the coreset graph is an -approximation on the original. We also improve the running time of the state-of-the-art coreset algorithm for kernel -means on sparse kernels, from to , where is the average number of non-zero entries in each row of the kernel matrix. Our experiments confirm our coreset algorithm is asymptotically faster on large real-world graphs with many clusters, and show that our clustering algorithm overcomes the main challenge faced by coreset kernel -means on sparse kernels which is getting stuck in local optima.

Paper Structure

This paper contains 23 sections, 7 theorems, 23 equations, 7 figures, 8 algorithms.

Key Result

Lemma 4.1

Given a kernel matrix $K$ corresponding to dataset $X$, Algorithm alg:fastdz returns an $(O(1),O(\log k))$-approximation for kernel $k$-means with high probability and running time $\tilde{O}(\min(d_{avg},k)\cdot n)$.

Figures (7)

  • Figure 1: Sketch of the Coreset Spectral Clustering Algorithm.
  • Figure 2: Running time comparison of coreset construction using either Algorithm \ref{['alg:Dz_samplingold']}jiang2022coresets or Algorithm \ref{['alg:fastdz']} for $D^2$-sampling. Shaded regions denote 1 standard deviation over 10 runs.
  • Figure 3: Running time, ARI, and Normalised cut of each algorithm on a 200-nearest neighbour graph of the HAR dataset. Shaded regions denote 1 standard deviation over 20 runs.
  • Figure 4: Running time and ARI of each algorithm on the stochastic block model with $k$ clusters of size $1000$, $p=1/2$, $q=0.001/k$ with a coreset size of $1\%$. Shaded regions denote 1 standard deviation over 20 runs.
  • Figure 5: Running time, ARI, and Normalised cut of each algorithm on a 250-nearest neighbour graph of the PenDigits dataset as coreset size varies.
  • ...and 2 more figures

Theorems & Definitions (17)

  • Definition 1: centroids
  • Definition 2: kernel $k$-means objective
  • Definition 3: Normalised cut objective
  • Definition 4: $\varepsilon$-coresets
  • Definition 5: $(\alpha,\beta)$-approximation for weighted kernel $k$-means
  • Lemma 4.1
  • Lemma 4.2
  • Lemma 5.1: Adapted from kanungo2002local
  • Theorem 1
  • proof : Proof of Theorem \ref{['theorem:csc']}
  • ...and 7 more