Table of Contents
Fetching ...

Coresets for Kernel Clustering

Shaofeng H. -C. Jiang, Robert Krauthgamer, Jianing Lou, Yubo Zhang

TL;DR

This work tackles the computational bottlenecks of kernel $k$-Means by introducing $\epsilon$-coresets that work for general kernels and preserve clustering costs for all $k$ centers. By embedding the potentially infinite-dimensional feature space into a finite Euclidean space and leveraging Euclidean coreset theory, the authors obtain a coreset of size $\mathrm{poly}(k/\epsilon)$ with near-linear construction time, enabling a $$(1+\epsilon)$$-approximation algorithm and a streaming variant. The approach generalizes to kernel $(k,z)$-Clustering, providing a FPT-PTAS and composable coresets suitable for merge-and-reduce in streaming settings. Empirically, the coresets yield small empirical error with far fewer points and accelerate kernel $k$-Means++ and spectral clustering across diverse kernels and datasets. The practical impact is substantial: faster kernel-based clustering and spectral methods with strong guarantees, scalable to large datasets and compatible with streaming and distributed settings.

Abstract

We devise coresets for kernel $k$-Means with a general kernel, and use them to obtain new, more efficient, algorithms. Kernel $k$-Means has superior clustering capability compared to classical $k$-Means, particularly when clusters are non-linearly separable, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs. Our main result is a coreset for kernel $k$-Means that works for a general kernel and has size $\mathrm{poly}(kε^{-1})$. Our new coreset both generalizes and greatly improves all previous results; moreover, it can be constructed in time near-linear in $n$. This result immediately implies new algorithms for kernel $k$-Means, such as a $(1+ε)$-approximation in time near-linear in $n$, and a streaming algorithm using space and update time $\mathrm{poly}(k ε^{-1} \log n)$. We validate our coreset on various datasets with different kernels. Our coreset performs consistently well, achieving small errors while using very few points. We show that our coresets can speed up kernel $k$-Means++ (the kernelized version of the widely used $k$-Means++ algorithm), and we further use this faster kernel $k$-Means++ for spectral clustering. In both applications, we achieve significant speedup and a better asymptotic growth while the error is comparable to baselines that do not use coresets.

Coresets for Kernel Clustering

TL;DR

This work tackles the computational bottlenecks of kernel -Means by introducing -coresets that work for general kernels and preserve clustering costs for all centers. By embedding the potentially infinite-dimensional feature space into a finite Euclidean space and leveraging Euclidean coreset theory, the authors obtain a coreset of size with near-linear construction time, enabling a -approximation algorithm and a streaming variant. The approach generalizes to kernel -Clustering, providing a FPT-PTAS and composable coresets suitable for merge-and-reduce in streaming settings. Empirically, the coresets yield small empirical error with far fewer points and accelerate kernel -Means++ and spectral clustering across diverse kernels and datasets. The practical impact is substantial: faster kernel-based clustering and spectral methods with strong guarantees, scalable to large datasets and compatible with streaming and distributed settings.

Abstract

We devise coresets for kernel -Means with a general kernel, and use them to obtain new, more efficient, algorithms. Kernel -Means has superior clustering capability compared to classical -Means, particularly when clusters are non-linearly separable, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs. Our main result is a coreset for kernel -Means that works for a general kernel and has size . Our new coreset both generalizes and greatly improves all previous results; moreover, it can be constructed in time near-linear in . This result immediately implies new algorithms for kernel -Means, such as a -approximation in time near-linear in , and a streaming algorithm using space and update time . We validate our coreset on various datasets with different kernels. Our coreset performs consistently well, achieving small errors while using very few points. We show that our coresets can speed up kernel -Means++ (the kernelized version of the widely used -Means++ algorithm), and we further use this faster kernel -Means++ for spectral clustering. In both applications, we achieve significant speedup and a better asymptotic growth while the error is comparable to baselines that do not use coresets.

Paper Structure

This paper contains 23 sections, 8 theorems, 10 equations, 3 figures, 1 table, 4 algorithms.

Key Result

Theorem 1.1

Given $n$-point weighted dataset $X$, oracle access to a kernel $K : X \times X \to \mathbb{R}$, integer $k \geq 1$ and $0 < \epsilon < 1$, one can construct in time $\tilde{O}(nk)$, a reweighted subset $S\subseteq X$ of size $\vert S\vert = \mathop{\mathrm{poly}}\nolimits(k\epsilon^{-1})$, that wit

Figures (3)

  • Figure 1: Tradeoffs between coreset size and empirical error.
  • Figure 2: Speedup of kernelized $\textsc{$k$-Means++}$ using our coreset. This experiment is conducted on the Twitter dataset with RBF and polynomial kernels. We run each algorithm 10 times, and report the average running time and the minimum objective value (in relative-error evaluation).
  • Figure 3: Speedup of spectral clustering using coreset-based kernelized $\textsc{$k$-Means++}$, with coreset size $N=2000$. Similar to Figure \ref{['fig:kmeanspp']}, we run each algorithm 10 times, report the average running time and the minimum objective value.

Theorems & Definitions (11)

  • Theorem 1.1: Informal version of Theorem \ref{['thm:main']}
  • Corollary 1.2: FPT-PTAS
  • Corollary 1.3: Streaming kernel $k$-Means
  • Corollary 2.2
  • Theorem 3.1
  • Lemma 3.2
  • proof
  • Corollary 3.3
  • proof
  • Theorem 3.4: DBLP:conf/soda/BravermanJKW21
  • ...and 1 more