Table of Contents
Fetching ...

Breathing K-Means: Superior K-Means Solutions through Dynamic K-Values

Bernd Fritzke

TL;DR

Breathing K-Means addresses the locality of standard Lloyd-based clustering by dynamically varying the codebook size through breathing cycles: inserting $m$ centroids near high-error regions and deleting $m$ low-utility centroids, with a freezing mechanism to prevent detrimental removals. The method alternates insertions and deletions with subsequent Lloyd refinements and terminates when improvements plateau, achieving non-local improvements over the baseline greedy k-means++ initialization. Across 51 diverse problems, Breathing K-Means consistently outperforms greedy k-means++ and nearly all competitors in solution quality, while maintaining favorable CPU-time characteristics, often matching or beating ten runs of baselines with a single run. The work positions breathing cycles as a robust, scalable enhancement to seeding procedures in k-means, offering a practical alternative to existing methods in scikit-learn pipelines. $\phi(\mathcal{C},\mathcal{X})$ minimization under dynamic codebooks yields strong empirical gains with limited extra computational burden.

Abstract

We introduce the breathing k-means algorithm, which on average significantly improves solutions obtained by the widely-known greedy k-means++ algorithm, the default method for k-means clustering in the scikit-learn package. The improvements are achieved through a novel ``breathing'' technique, that cyclically increases and decreases the number of centroids based on local error and utility measures. We conducted experiments using greedy k-means++ as a baseline, comparing it with breathing k-means and five other k-means algorithms. Among the methods investigated, only breathing k-means and better k-means++ consistently outperformed the baseline, with breathing k-means demonstrating a substantial lead. This superior performance was maintained even when comparing the best result of ten runs for all other algorithms to a single run of breathing k-means, highlighting its effectiveness and speed. Our findings indicate that the breathing k-means algorithm outperforms the other k-means techniques, especially greedy k-means++ with ten repetitions, which it dominates in both solution quality and speed. This positions breathing k-means (with the built-in initialization by a single run of greedy k-means++) as a superior alternative to running greedy k-means++ on its own.

Breathing K-Means: Superior K-Means Solutions through Dynamic K-Values

TL;DR

Breathing K-Means addresses the locality of standard Lloyd-based clustering by dynamically varying the codebook size through breathing cycles: inserting centroids near high-error regions and deleting low-utility centroids, with a freezing mechanism to prevent detrimental removals. The method alternates insertions and deletions with subsequent Lloyd refinements and terminates when improvements plateau, achieving non-local improvements over the baseline greedy k-means++ initialization. Across 51 diverse problems, Breathing K-Means consistently outperforms greedy k-means++ and nearly all competitors in solution quality, while maintaining favorable CPU-time characteristics, often matching or beating ten runs of baselines with a single run. The work positions breathing cycles as a robust, scalable enhancement to seeding procedures in k-means, offering a practical alternative to existing methods in scikit-learn pipelines. minimization under dynamic codebooks yields strong empirical gains with limited extra computational burden.

Abstract

We introduce the breathing k-means algorithm, which on average significantly improves solutions obtained by the widely-known greedy k-means++ algorithm, the default method for k-means clustering in the scikit-learn package. The improvements are achieved through a novel ``breathing'' technique, that cyclically increases and decreases the number of centroids based on local error and utility measures. We conducted experiments using greedy k-means++ as a baseline, comparing it with breathing k-means and five other k-means algorithms. Among the methods investigated, only breathing k-means and better k-means++ consistently outperformed the baseline, with breathing k-means demonstrating a substantial lead. This superior performance was maintained even when comparing the best result of ten runs for all other algorithms to a single run of breathing k-means, highlighting its effectiveness and speed. Our findings indicate that the breathing k-means algorithm outperforms the other k-means techniques, especially greedy k-means++ with ten repetitions, which it dominates in both solution quality and speed. This positions breathing k-means (with the built-in initialization by a single run of greedy k-means++) as a superior alternative to running greedy k-means++ on its own.

Paper Structure

This paper contains 51 sections, 1 theorem, 18 equations, 10 figures, 19 tables, 8 algorithms.

Key Result

Theorem 1

Let $P \subseteq \mathbb{R}^{d}$ be a set of points and C be the output of Algorithm 1 with $Z \ge 100000k \log \log k$ then we have $E[cost(P, C)] \in O(cost(P, C^{*}))$, where $C^{*}$ is the set of optimum centers. The algorithm's running time is $O(dnk^2 \log \log k)$.

Figures (10)

  • Figure 1: Error and utility values are shown for a problem with data from six equal Gaussian kernels and $k=6$, each centroid placed at a cluster center. While error values are similar, the utilities of centroids differ based on the distance between the nearest and second-nearest centroids. The most useful centroid is in cluster A, while the least useful is in cluster D, followed by those in B and C.
  • Figure 2: The problem of misleading utility values of close neighbors. a) The two centroids in the small cluster A exhibit low utilities (red). b) Eliminating one of them marginally escalates the error $\phi$, while the remaining one sees its utility spike. c) The simultaneous removal of the second centroid from A leads to an enormous total error (226.1), and the closest centroid to A becomes highly useful (84.4).
  • Figure 3: Problems with known optimum. The data points are shown in green, and the optimal centroids are in red.
  • Figure 4: K-means problems based on two-dimensional data sets from the literature (see Table \ref{['tab:datalit']}).
  • Figure 5: Modified literature problems. The data sets were constructed from the problems shown in Figure \ref{['fig:prob-lit']} by taking a random subset of size 200 and adding a high-density cluster consisting of 4000 data points below the centroid of the subset. The $k$-values remain the same as in Table \ref{['tab:datalit']}.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Theorem 1