Table of Contents
Fetching ...

A New Rejection Sampling Approach to $k$-$\mathtt{means}$++ With Improved Trade-Offs

Poojan Shah, Shashwat Agrawal, Ragesh Jaiswal

TL;DR

This work addresses the computational bottleneck of $k$-means++ seeding by introducing a rejection-sampling framework (RS-$k$means++) that speeds up $D^2$-sampling. It provides two variants: a fast version with preprocessing $\tilde{O}(\mathrm{nnz}(\mathcal{X}))$ and per-clustering $O(mk^2d\log k)$ time achieving $\mathbb{E}[\Delta(\mathcal{X},S)] \le 8(\ln k+2)\Delta_k(\mathcal{X}) + \frac{6k}{k^{\frac{cm}{2\beta(\mathcal{X})}}-1}\Delta_1(\mathcal{X})$, and a second variant adding a scale-invariant additive term $k^{-\Omega(m/\beta(\mathcal{X}))}\mathrm{Var}(\mathcal{X})$ to the baseline $O(\log k)$ guarantee, improving on prior $O(m^{-1})\mathrm{Var}(\mathcal{X})$ results. The approach relies on a data-structure-enabled rejection sampler to convert easy samples into $D^2(\mathcal{X},S)$ samples, allowing fast data updates and parallelization. Theoretical bounds are complemented by extensive experiments showing comparable or better clustering quality at reduced runtimes compared to existing speedups, and the method gracefully adapts to varying $m$ to trade precision for speed.

Abstract

The $k$-$\mathtt{means}$++ seeding algorithm (Arthur & Vassilvitskii, 2007) is widely used in practice for the $k$-means clustering problem where the goal is to cluster a dataset $\mathcal{X} \subset \mathbb{R} ^d$ into $k$ clusters. The popularity of this algorithm is due to its simplicity and provable guarantee of being $O(\log k)$ competitive with the optimal solution in expectation. However, its running time is $O(|\mathcal{X}|kd)$, making it expensive for large datasets. In this work, we present a simple and effective rejection sampling based approach for speeding up $k$-$\mathtt{means}$++. Our first method runs in time $\tilde{O}(\mathtt{nnz} (\mathcal{X}) + βk^2d)$ while still being $O(\log k )$ competitive in expectation. Here, $β$ is a parameter which is the ratio of the variance of the dataset to the optimal $k$-$\mathtt{means}$ cost in expectation and $\tilde{O}$ hides logarithmic factors in $k$ and $|\mathcal{X}|$. Our second method presents a new trade-off between computational cost and solution quality. It incurs an additional scale-invariant factor of $ k^{-Ω( m/β)} \operatorname{Var} (\mathcal{X})$ in addition to the $O(\log k)$ guarantee of $k$-$\mathtt{means}$++ improving upon a result of (Bachem et al, 2016a) who get an additional factor of $m^{-1}\operatorname{Var}(\mathcal{X})$ while still running in time $\tilde{O}(\mathtt{nnz}(\mathcal{X}) + mk^2d)$. We perform extensive empirical evaluations to validate our theoretical results and to show the effectiveness of our approach on real datasets.

A New Rejection Sampling Approach to $k$-$\mathtt{means}$++ With Improved Trade-Offs

TL;DR

This work addresses the computational bottleneck of -means++ seeding by introducing a rejection-sampling framework (RS-means++) that speeds up -sampling. It provides two variants: a fast version with preprocessing and per-clustering time achieving , and a second variant adding a scale-invariant additive term to the baseline guarantee, improving on prior results. The approach relies on a data-structure-enabled rejection sampler to convert easy samples into samples, allowing fast data updates and parallelization. Theoretical bounds are complemented by extensive experiments showing comparable or better clustering quality at reduced runtimes compared to existing speedups, and the method gracefully adapts to varying to trade precision for speed.

Abstract

The -++ seeding algorithm (Arthur & Vassilvitskii, 2007) is widely used in practice for the -means clustering problem where the goal is to cluster a dataset into clusters. The popularity of this algorithm is due to its simplicity and provable guarantee of being competitive with the optimal solution in expectation. However, its running time is , making it expensive for large datasets. In this work, we present a simple and effective rejection sampling based approach for speeding up -++. Our first method runs in time while still being competitive in expectation. Here, is a parameter which is the ratio of the variance of the dataset to the optimal - cost in expectation and hides logarithmic factors in and . Our second method presents a new trade-off between computational cost and solution quality. It incurs an additional scale-invariant factor of in addition to the guarantee of -++ improving upon a result of (Bachem et al, 2016a) who get an additional factor of while still running in time . We perform extensive empirical evaluations to validate our theoretical results and to show the effectiveness of our approach on real datasets.

Paper Structure

This paper contains 15 sections, 21 theorems, 51 equations, 2 figures, 5 tables, 7 algorithms.

Key Result

Theorem 2.1

(Main Theorem) Let $m \in \mathbb{N}$ be a parameter and $k \in \mathbb{N}$ be the number of clusters. Let $\mathcal{X} \subset \mathbb{R}^d$ be any dataset of $n$ points and $S$ be the output of $\mathtt{RS\text{-}k\text{-}means}$++ $(\mathcal{X},k,m')$ where $m' = cm\ln k$ for some constant $c > 1 Here $\beta(\mathcal{X})$As can be seen from the description, the value of $\beta(\mathcal{X})$ is

Figures (2)

  • Figure 1: Data structure for sampling from a vector $v \in \mathbb{R}^4$
  • Figure 2: Trade-off plots

Theorems & Definitions (36)

  • Theorem 2.1
  • Theorem 2.2
  • Definition 4.1
  • Lemma 4.2
  • proof
  • Remark 4.3
  • Lemma 4.4
  • proof
  • Lemma 4.5
  • proof
  • ...and 26 more