Coresets for Kernel Clustering

Shaofeng H. -C. Jiang; Robert Krauthgamer; Jianing Lou; Yubo Zhang

Coresets for Kernel Clustering

Shaofeng H. -C. Jiang, Robert Krauthgamer, Jianing Lou, Yubo Zhang

TL;DR

This work tackles the computational bottlenecks of kernel $k$-Means by introducing $\epsilon$-coresets that work for general kernels and preserve clustering costs for all $k$ centers. By embedding the potentially infinite-dimensional feature space into a finite Euclidean space and leveraging Euclidean coreset theory, the authors obtain a coreset of size $\mathrm{poly}(k/\epsilon)$ with near-linear construction time, enabling a $$(1+\epsilon)$$-approximation algorithm and a streaming variant. The approach generalizes to kernel $(k,z)$-Clustering, providing a FPT-PTAS and composable coresets suitable for merge-and-reduce in streaming settings. Empirically, the coresets yield small empirical error with far fewer points and accelerate kernel $k$-Means++ and spectral clustering across diverse kernels and datasets. The practical impact is substantial: faster kernel-based clustering and spectral methods with strong guarantees, scalable to large datasets and compatible with streaming and distributed settings.

Abstract

We devise coresets for kernel $k$-Means with a general kernel, and use them to obtain new, more efficient, algorithms. Kernel $k$-Means has superior clustering capability compared to classical $k$-Means, particularly when clusters are non-linearly separable, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs. Our main result is a coreset for kernel $k$-Means that works for a general kernel and has size $\mathrm{poly}(kε^{-1})$. Our new coreset both generalizes and greatly improves all previous results; moreover, it can be constructed in time near-linear in $n$. This result immediately implies new algorithms for kernel $k$-Means, such as a $(1+ε)$-approximation in time near-linear in $n$, and a streaming algorithm using space and update time $\mathrm{poly}(k ε^{-1} \log n)$. We validate our coreset on various datasets with different kernels. Our coreset performs consistently well, achieving small errors while using very few points. We show that our coresets can speed up kernel $k$-Means++ (the kernelized version of the widely used $k$-Means++ algorithm), and we further use this faster kernel $k$-Means++ for spectral clustering. In both applications, we achieve significant speedup and a better asymptotic growth while the error is comparable to baselines that do not use coresets.

Coresets for Kernel Clustering

TL;DR

This work tackles the computational bottlenecks of kernel

-Means by introducing

-coresets that work for general kernels and preserve clustering costs for all

centers. By embedding the potentially infinite-dimensional feature space into a finite Euclidean space and leveraging Euclidean coreset theory, the authors obtain a coreset of size

with near-linear construction time, enabling a

-approximation algorithm and a streaming variant. The approach generalizes to kernel

-Clustering, providing a FPT-PTAS and composable coresets suitable for merge-and-reduce in streaming settings. Empirically, the coresets yield small empirical error with far fewer points and accelerate kernel

-Means++ and spectral clustering across diverse kernels and datasets. The practical impact is substantial: faster kernel-based clustering and spectral methods with strong guarantees, scalable to large datasets and compatible with streaming and distributed settings.

Abstract

We devise coresets for kernel

-Means with a general kernel, and use them to obtain new, more efficient, algorithms. Kernel

-Means has superior clustering capability compared to classical

-Means, particularly when clusters are non-linearly separable, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs. Our main result is a coreset for kernel

-Means that works for a general kernel and has size

. Our new coreset both generalizes and greatly improves all previous results; moreover, it can be constructed in time near-linear in

. This result immediately implies new algorithms for kernel

-Means, such as a

-approximation in time near-linear in

, and a streaming algorithm using space and update time

. We validate our coreset on various datasets with different kernels. Our coreset performs consistently well, achieving small errors while using very few points. We show that our coresets can speed up kernel

-Means++ (the kernelized version of the widely used

-Means++ algorithm), and we further use this faster kernel

-Means++ for spectral clustering. In both applications, we achieve significant speedup and a better asymptotic growth while the error is comparable to baselines that do not use coresets.

Coresets for Kernel Clustering

TL;DR

Abstract

Coresets for Kernel Clustering

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (11)