Table of Contents
Fetching ...

Multi-Swap $k$-Means++

Lorenzo Beretta, Vincent Cohen-Addad, Silvio Lattanzi, Nikos Parotsidis

TL;DR

This work generalizes and extends the Lattanzi and Sohler local search algorithm by considering larger and more sophisticated local search neighborhoods hence allowing to swap multiple centers at the same time and achieves a $9 + \varepsilon$ approximation ratio, which is the best possible for local search.

Abstract

The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective and is known to give an $O(\log k)$-approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting $k$-means++ with $O(k \log \log k)$ local search steps obtained through the $k$-means++ sampling distribution to yield a $c$-approximation to the $k$-means clustering problem, where $c$ is a large absolute constant. Here we generalize and extend their local search algorithm by considering larger and more sophisticated local search neighborhoods hence allowing to swap multiple centers at the same time. Our algorithm achieves a $9 + \varepsilon$ approximation ratio, which is the best possible for local search. Importantly we show that our approach yields substantial practical improvements, we show significant quality improvements over the approach of Lattanzi and Sohler (ICML 2019) on several datasets.

Multi-Swap $k$-Means++

TL;DR

This work generalizes and extends the Lattanzi and Sohler local search algorithm by considering larger and more sophisticated local search neighborhoods hence allowing to swap multiple centers at the same time and achieves a approximation ratio, which is the best possible for local search.

Abstract

The -means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular -means clustering objective and is known to give an -approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting -means++ with local search steps obtained through the -means++ sampling distribution to yield a -approximation to the -means clustering problem, where is a large absolute constant. Here we generalize and extend their local search algorithm by considering larger and more sophisticated local search neighborhoods hence allowing to swap multiple centers at the same time. Our algorithm achieves a approximation ratio, which is the best possible for local search. Importantly we show that our approach yields substantial practical improvements, we show significant quality improvements over the approach of Lattanzi and Sohler (ICML 2019) on several datasets.
Paper Structure (27 sections, 22 theorems, 19 equations, 6 figures)

This paper contains 27 sections, 22 theorems, 19 equations, 6 figures.

Key Result

Lemma 1

Given a point set $Q \subseteq P$ and a point $p\in P$ we have

Figures (6)

  • Figure 1: Comparison between MSLS and MSLS-G, for $p = 3$, for $k=25$, on the datasets KDD-BIO and RNA. The $y$ axis shows the solution cost divided by the means solution cost of KM++.
  • Figure 2: The first row compares the cost of MSLS-G, for $p\in\{1,4, 7, 10\}$, divided by the mean cost of KM++ at each LS step, for $k=25$. The legend reports also the running time of MSLS-G per LS step (in seconds). The second row compares the cost after each of the $10$ iterations of Lloyd with seeding from MSLS-G, for $p\in\{1,4, 7, 10\}$ and $15$ local search steps and KM++, for $k=25$.
  • Figure 3: Comparison of the cost produced by MSLS-G, for $p\in\{1,4, 7, 10\}$ and $k=25$ on the datasets KDD-BIO and KDD-PHU, divided by the mean cost of KM++ after running for fixed amount of time in terms of multiplicative factors to the average time for an iteration of Lloyd's algorithm (i.e., for deadlines that are $1\times, \dots, 20\times$ the average time of an iteration of Lloyd).
  • Figure 4: Comparison between MSLS and MSLS-G, for $p =2$ (left column) and $p=3$ (right column), for $k=25$, on the datasets KDD-BIO (first row), KDD-PHY (second row) and RNA (third row). The $y$ axis shows the mean solution cost, over the 5 repetitions of the experiment, divided by the means solution cost of KM++.
  • Figure 5: We compare the cost of MSLS-G, for $p\in\{1,4, 7, 10\}$, divided by the mean cost of KM++ at each LS step, for $k\in\{10, 25, 50\}$, excluding the degenerate case $p=k=10$. The legend reports also the running time of MSLS-G per LS step (in seconds). The experiments were run on all datasets: KDD-BIO, RNA and KDD-PHY, excluding the case of $k=25$ for KDD-BIO and RNA which are reported in the main body of the paper.
  • ...and 1 more figures

Theorems & Definitions (38)

  • Lemma 1
  • Lemma 2
  • proof
  • Theorem 3
  • Corollary 4
  • Corollary 5
  • Lemma 6
  • proof : Proof of \ref{['thm:multi-swap-analysis']}
  • Lemma 7
  • proof
  • ...and 28 more