Coreset Spectral Clustering

Ben Jourdan; Gregory Schwartzman; Peter Macgregor; He Sun

Coreset Spectral Clustering

Ben Jourdan, Gregory Schwartzman, Peter Macgregor, He Sun

TL;DR

This work addresses scalable graph clustering by uniting spectral clustering and kernel $k$-means through the equivalence between the normalised cut and kernel $k$-means, enabling efficient processing of large sparse graphs. It introduces Coreset Spectral Clustering (CSC), which builds an $oldsymbol{varepsilon}$-coreset and solves the normalised cut on the coreset, with a provable transfer of quality to the full graph. Theoretical results show that an $oldsymbol{alpha}$-approximation on the coreset yields an $O(oldsymbol{alpha})$-approximation on the original graph, and a faster coreset construction achieves $ ilde{O}( ext{min}(d_{avg},k) )$ time for sparse kernels. Empirically, CSC delivers significant speedups on large graphs and maintains or improves clustering quality, even as the number of clusters grows large, outperforming coreset kernel $k$-means and standard spectral clustering in several settings.

Abstract

Coresets have become an invaluable tool for solving $k$-means and kernel $k$-means clustering problems on large datasets with small numbers of clusters. On the other hand, spectral clustering works well on sparse graphs and has recently been extended to scale efficiently to large numbers of clusters. We exploit the connection between kernel $k$-means and the normalised cut problem to combine the benefits of both. Our main result is a coreset spectral clustering algorithm for graphs that clusters a coreset graph to infer a good labelling of the original graph. We prove that an $α$-approximation for the normalised cut problem on the coreset graph is an $O(α)$-approximation on the original. We also improve the running time of the state-of-the-art coreset algorithm for kernel $k$-means on sparse kernels, from $\tilde{O}(nk)$ to $\tilde{O}(n\cdot \min \{k, d_{avg}\})$, where $d_{avg}$ is the average number of non-zero entries in each row of the $n\times n$ kernel matrix. Our experiments confirm our coreset algorithm is asymptotically faster on large real-world graphs with many clusters, and show that our clustering algorithm overcomes the main challenge faced by coreset kernel $k$-means on sparse kernels which is getting stuck in local optima.

Coreset Spectral Clustering

TL;DR

This work addresses scalable graph clustering by uniting spectral clustering and kernel

-means through the equivalence between the normalised cut and kernel

-means, enabling efficient processing of large sparse graphs. It introduces Coreset Spectral Clustering (CSC), which builds an

-coreset and solves the normalised cut on the coreset, with a provable transfer of quality to the full graph. Theoretical results show that an

-approximation on the coreset yields an

-approximation on the original graph, and a faster coreset construction achieves

time for sparse kernels. Empirically, CSC delivers significant speedups on large graphs and maintains or improves clustering quality, even as the number of clusters grows large, outperforming coreset kernel

-means and standard spectral clustering in several settings.

Abstract

Coresets have become an invaluable tool for solving

-means and kernel

-means clustering problems on large datasets with small numbers of clusters. On the other hand, spectral clustering works well on sparse graphs and has recently been extended to scale efficiently to large numbers of clusters. We exploit the connection between kernel

-means and the normalised cut problem to combine the benefits of both. Our main result is a coreset spectral clustering algorithm for graphs that clusters a coreset graph to infer a good labelling of the original graph. We prove that an

-approximation for the normalised cut problem on the coreset graph is an

-approximation on the original. We also improve the running time of the state-of-the-art coreset algorithm for kernel

-means on sparse kernels, from

, where

is the average number of non-zero entries in each row of the

kernel matrix. Our experiments confirm our coreset algorithm is asymptotically faster on large real-world graphs with many clusters, and show that our clustering algorithm overcomes the main challenge faced by coreset kernel

-means on sparse kernels which is getting stuck in local optima.

Coreset Spectral Clustering

TL;DR

Abstract

Coreset Spectral Clustering

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (17)