Mini-Batch Kernel $k$-means

Ben Jourdan; Gregory Schwartzman

Mini-Batch Kernel $k$-means

Ben Jourdan, Gregory Schwartzman

TL;DR

This work presents the first mini-batch kernel $k$-means algorithm, offering an order of magnitude improvement in running time compared to the full batch algorithm, and achieves an approximation ratio of $O(\log k)$ in expectation.

Abstract

We present the first mini-batch kernel $k$-means algorithm, offering an order of magnitude improvement in running time compared to the full batch algorithm. A single iteration of our algorithm takes $\widetilde{O}(kb^2)$ time, significantly faster than the $O(n^2)$ time required by the full batch kernel $k$-means, where $n$ is the dataset size and $b$ is the batch size. Extensive experiments demonstrate that our algorithm consistently achieves a 10-100x speedup with minimal loss in quality, addressing the slow runtime that has limited kernel $k$-means adoption in practice. We further complement these results with a theoretical analysis under an early stopping condition, proving that with a batch size of $\widetildeΩ(\max \{γ^{4}, γ^{2}\} \cdot ε^{-2})$, the algorithm terminates in $O(γ^2/ε)$ iterations with high probability, where $γ$ bounds the norm of points in feature space and $ε$ is a termination threshold. Our analysis holds for any reasonable center initialization, and when using $k$-means++ initialization, the algorithm achieves an approximation ratio of $O(\log k)$ in expectation. For normalized kernels, such as Gaussian or Laplacian it holds that $γ=1$. Taking $ε= O(1)$ and $b=Θ(\log n)$, the algorithm terminates in $O(1)$ iterations, with each iteration running in $\widetilde{O}(k)$ time.

Mini-Batch Kernel $k$-means

TL;DR

This work presents the first mini-batch kernel

-means algorithm, offering an order of magnitude improvement in running time compared to the full batch algorithm, and achieves an approximation ratio of

in expectation.

Abstract

We present the first mini-batch kernel

-means algorithm, offering an order of magnitude improvement in running time compared to the full batch algorithm. A single iteration of our algorithm takes

time, significantly faster than the

time required by the full batch kernel

-means, where

is the dataset size and

is the batch size. Extensive experiments demonstrate that our algorithm consistently achieves a 10-100x speedup with minimal loss in quality, addressing the slow runtime that has limited kernel

-means adoption in practice. We further complement these results with a theoretical analysis under an early stopping condition, proving that with a batch size of

, the algorithm terminates in

iterations with high probability, where

bounds the norm of points in feature space and

is a termination threshold. Our analysis holds for any reasonable center initialization, and when using

-means++ initialization, the algorithm achieves an approximation ratio of

in expectation. For normalized kernels, such as Gaussian or Laplacian it holds that

. Taking

and

, the algorithm terminates in

iterations, with each iteration running in

time.

Paper Structure (28 sections, 13 theorems, 22 equations, 13 figures, 1 table, 2 algorithms)

This paper contains 28 sections, 13 theorems, 22 equations, 13 figures, 1 table, 2 algorithms.

Introduction
Problem statement
Lloyd's algorithm
Mini-batch $k$-means
Lloyd's algorithm in feature space
Mini-batch kernel $k$-means
Related work
Preliminaries
Kernel $k$-means
Our Algorithm
Recursive distance update rule
Truncating the centers
Algorithm implmentation and runtime
Termination guarantee
Section preliminaries
...and 13 more sections

Key Result

Theorem 1

The following holds for Algorithm alg: alg2: (1) Each iteration takes $O(kb^2\log^2(\gamma/\epsilon))$ time, (2) If $b=\Omega(\max \left\{ \gamma^{4}, \gamma^{2} \right\}{\epsilon}^{-2} \log^2 (\gamma n/{\epsilon}))$ then it terminates in $O(\gamma^2/\epsilon)$ iterations w.h.p, (3) When initialized

Figures (13)

Figure 1: Our results for a batch size of size 1024 and $\tau = 200$ using the Gaussian kernel. We use the $\beta$ prefix to denote the algorithm uses the learning rate of Schwartzman23. Black denotes the time required to compute the kernel.
Figure 2: Experimental results on the MNIST dataset where the kernel algorithms use the Gaussian kernel.
Figure 3: Experimental results on the MNIST dataset where the kernel algorithms use the k-nn kernel.
Figure 4: Experimental results on the MNIST dataset where the kernel algorithms use the Heat kernel.
Figure 5: Experimental results on the Har dataset where the kernel algorithms use the Gaussian kernel.
...and 8 more figures

Theorems & Definitions (24)

Theorem 1
Definition 2
Lemma 3
proof
Lemma 4
proof
Lemma 5
proof
Theorem 6: naor2012banach
Theorem 7: hoeffding1994probability
...and 14 more

Mini-Batch Kernel $k$-means

TL;DR

Abstract

Mini-Batch Kernel $k$-means

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (24)