Table of Contents
Fetching ...

Mini-Batch Kernel $k$-means

Ben Jourdan, Gregory Schwartzman

TL;DR

This work presents the first mini-batch kernel $k$-means algorithm, offering an order of magnitude improvement in running time compared to the full batch algorithm, and achieves an approximation ratio of $O(\log k)$ in expectation.

Abstract

We present the first mini-batch kernel $k$-means algorithm, offering an order of magnitude improvement in running time compared to the full batch algorithm. A single iteration of our algorithm takes $\widetilde{O}(kb^2)$ time, significantly faster than the $O(n^2)$ time required by the full batch kernel $k$-means, where $n$ is the dataset size and $b$ is the batch size. Extensive experiments demonstrate that our algorithm consistently achieves a 10-100x speedup with minimal loss in quality, addressing the slow runtime that has limited kernel $k$-means adoption in practice. We further complement these results with a theoretical analysis under an early stopping condition, proving that with a batch size of $\widetildeΩ(\max \{γ^{4}, γ^{2}\} \cdot ε^{-2})$, the algorithm terminates in $O(γ^2/ε)$ iterations with high probability, where $γ$ bounds the norm of points in feature space and $ε$ is a termination threshold. Our analysis holds for any reasonable center initialization, and when using $k$-means++ initialization, the algorithm achieves an approximation ratio of $O(\log k)$ in expectation. For normalized kernels, such as Gaussian or Laplacian it holds that $γ=1$. Taking $ε= O(1)$ and $b=Θ(\log n)$, the algorithm terminates in $O(1)$ iterations, with each iteration running in $\widetilde{O}(k)$ time.

Mini-Batch Kernel $k$-means

TL;DR

This work presents the first mini-batch kernel -means algorithm, offering an order of magnitude improvement in running time compared to the full batch algorithm, and achieves an approximation ratio of in expectation.

Abstract

We present the first mini-batch kernel -means algorithm, offering an order of magnitude improvement in running time compared to the full batch algorithm. A single iteration of our algorithm takes time, significantly faster than the time required by the full batch kernel -means, where is the dataset size and is the batch size. Extensive experiments demonstrate that our algorithm consistently achieves a 10-100x speedup with minimal loss in quality, addressing the slow runtime that has limited kernel -means adoption in practice. We further complement these results with a theoretical analysis under an early stopping condition, proving that with a batch size of , the algorithm terminates in iterations with high probability, where bounds the norm of points in feature space and is a termination threshold. Our analysis holds for any reasonable center initialization, and when using -means++ initialization, the algorithm achieves an approximation ratio of in expectation. For normalized kernels, such as Gaussian or Laplacian it holds that . Taking and , the algorithm terminates in iterations, with each iteration running in time.
Paper Structure (28 sections, 13 theorems, 22 equations, 13 figures, 1 table, 2 algorithms)

This paper contains 28 sections, 13 theorems, 22 equations, 13 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

The following holds for Algorithm alg: alg2: (1) Each iteration takes $O(kb^2\log^2(\gamma/\epsilon))$ time, (2) If $b=\Omega(\max \left\{ \gamma^{4}, \gamma^{2} \right\}{\epsilon}^{-2} \log^2 (\gamma n/{\epsilon}))$ then it terminates in $O(\gamma^2/\epsilon)$ iterations w.h.p, (3) When initialized

Figures (13)

  • Figure 1: Our results for a batch size of size 1024 and $\tau = 200$ using the Gaussian kernel. We use the $\beta$ prefix to denote the algorithm uses the learning rate of Schwartzman23. Black denotes the time required to compute the kernel.
  • Figure 2: Experimental results on the MNIST dataset where the kernel algorithms use the Gaussian kernel.
  • Figure 3: Experimental results on the MNIST dataset where the kernel algorithms use the k-nn kernel.
  • Figure 4: Experimental results on the MNIST dataset where the kernel algorithms use the Heat kernel.
  • Figure 5: Experimental results on the Har dataset where the kernel algorithms use the Gaussian kernel.
  • ...and 8 more figures

Theorems & Definitions (24)

  • Theorem 1
  • Definition 2
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • Theorem 6: naor2012banach
  • Theorem 7: hoeffding1994probability
  • ...and 14 more