Table of Contents
Fetching ...

Accelerating the k-means++ Algorithm by Using Geometric Information

Guillem Rodríguez Corominas, Maria J. Blesa, Christian Blum

TL;DR

This paper proposes an acceleration of the exact k-means++ algorithm using geometric information, specifically the Triangle Inequality and additional norm filters, along with a two-step sampling procedure, which outperforms the standard k-means++ version in terms of the number of visited points and distance calculations.

Abstract

In this paper, we propose an acceleration of the exact k-means++ algorithm using geometric information, specifically the Triangle Inequality and additional norm filters, along with a two-step sampling procedure. Our experiments demonstrate that the accelerated version outperforms the standard k-means++ version in terms of the number of visited points and distance calculations, achieving greater speedup as the number of clusters increases. The version utilizing the Triangle Inequality is particularly effective for low-dimensional data, while the additional norm-based filter enhances performance in high-dimensional instances with greater norm variance among points. Additional experiments show the behavior of our algorithms when executed concurrently across multiple jobs and examine how memory performance impacts practical speedup.

Accelerating the k-means++ Algorithm by Using Geometric Information

TL;DR

This paper proposes an acceleration of the exact k-means++ algorithm using geometric information, specifically the Triangle Inequality and additional norm filters, along with a two-step sampling procedure, which outperforms the standard k-means++ version in terms of the number of visited points and distance calculations.

Abstract

In this paper, we propose an acceleration of the exact k-means++ algorithm using geometric information, specifically the Triangle Inequality and additional norm filters, along with a two-step sampling procedure. Our experiments demonstrate that the accelerated version outperforms the standard k-means++ version in terms of the number of visited points and distance calculations, achieving greater speedup as the number of clusters increases. The version utilizing the Triangle Inequality is particularly effective for low-dimensional data, while the additional norm-based filter enhances performance in high-dimensional instances with greater norm variance among points. Additional experiments show the behavior of our algorithms when executed concurrently across multiple jobs and examine how memory performance impacts practical speedup.
Paper Structure (26 sections, 17 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 26 sections, 17 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: Clustering example in two dimensions. The data points are depicted in purple and the centers in red, with the currently assigned center marked by a star shape. The area highlighted in yellow represents twice the radius of the cluster formed by these points (assuming ED). Around each data point, there is a dashed circle with a radius equal to the ED to the point's assigned center. The blue-shaded area delimits the space between the lower and upper bounds for centers to be considered.
  • Figure 2: Percentage of examined points (in relation to the standard $k$-means$++$) for the accelerated $k$-means$++$ version using only the TIE filter (upper row), and for the accelerated $k$-means$++$ version that also uses the additional norm filter (lower row).
  • Figure 3: Percentage of calculated distances (in relation to the standard $k$-means$++$) for the accelerated $k$-means$++$ version using only the TIE filter (upper row), and for the accelerated $k$-means$++$ version that also uses the additional norm filter (lower row).
  • Figure 4: Speedups of the accelerated $k$-means$++$ variants with the standard $k$-means$++$ algorithm (first and second row), and speedup of the full accelerated $k$-means$++$ variant with the $k$-means$++$ variant that does not use the norm filter (third row).
  • Figure 5: Two-dimensional visualization of a subset of instances using PCA, for low-dimensional (top row) and high-dimensional (bottom row) instances.