Table of Contents
Fetching ...

Accelerating k-Means Clustering with Cover Trees

Andreas Lang, Erich Schubert

TL;DR

A hybrid algorithm is proposed that combines the benefits of tree aggregation and bounds-based filtering in the k-means clustering algorithm, that has relatively low overhead and performs well, for a wider parameter range, than previous approaches based on the k-d tree.

Abstract

The k-means clustering algorithm is a popular algorithm that partitions data into k clusters. There are many improvements to accelerate the standard algorithm. Most current research employs upper and lower bounds on point-to-cluster distances and the triangle inequality to reduce the number of distance computations, with only arrays as underlying data structures. These approaches cannot exploit that nearby points are likely assigned to the same cluster. We propose a new k-means algorithm based on the cover tree index, that has relatively low overhead and performs well, for a wider parameter range, than previous approaches based on the k-d tree. By combining this with upper and lower bounds, as in state-of-the-art approaches, we obtain a hybrid algorithm that combines the benefits of tree aggregation and bounds-based filtering.

Accelerating k-Means Clustering with Cover Trees

TL;DR

A hybrid algorithm is proposed that combines the benefits of tree aggregation and bounds-based filtering in the k-means clustering algorithm, that has relatively low overhead and performs well, for a wider parameter range, than previous approaches based on the k-d tree.

Abstract

The k-means clustering algorithm is a popular algorithm that partitions data into k clusters. There are many improvements to accelerate the standard algorithm. Most current research employs upper and lower bounds on point-to-cluster distances and the triangle inequality to reduce the number of distance computations, with only arrays as underlying data structures. These approaches cannot exploit that nearby points are likely assigned to the same cluster. We propose a new k-means algorithm based on the cover tree index, that has relatively low overhead and performs well, for a wider parameter range, than previous approaches based on the k-d tree. By combining this with upper and lower bounds, as in state-of-the-art approaches, we obtain a hybrid algorithm that combines the benefits of tree aggregation and bounds-based filtering.

Paper Structure

This paper contains 13 sections, 17 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Commulative evaluation in relation to the Standard algorithm vs. iterations on the ALOI 64D dataset for $k$ = 400
  • Figure 2: Runtime in relation to the Standard algorithm vs. $k$ respectively $d$ on the MNIST dataset