Table of Contents
Fetching ...

Parallelization of the K-Means Algorithm with Applications to Big Data Clustering

Ashish Srivastava, Mohammed Nawfal

TL;DR

Two different approaches to clustering using LLoyd's algorithm are compared and metrics such as speed up, efficiency, time taken with varying data points, and number of processes are analyzed to compare the two approaches and understand the relative performance improvement they can get.

Abstract

The K-Means clustering using LLoyd's algorithm is an iterative approach to partition the given dataset into K different clusters. The algorithm assigns each point to the cluster based on the following objective function \[\ \min Σ_{i=1}^{n}||x_i-μ_{x_i}||^2\] The serial algorithm involves iterative steps where we compute the distance of each datapoint from the centroids and assign the datapoint to the nearest centroid. This approach is essentially known as the expectation-maximization step. Clustering involves extensive computations to calculate distances at each iteration, which increases as the number of data points increases. This provides scope for parallelism. However, we must ensure that in a parallel process, each thread has access to the updated centroid value and no racing condition exists on any centroid values. We will compare two different approaches in this project. The first approach is an OpenMP flat synchronous method where all processes are run in parallel, and we use synchronization to ensure safe updates of clusters. The second approach we adopt is a GPU based parallelization approach using OpenACC wherein we will try to make use of GPU architecture to parallelize chunks of the algorithm to observe decreased computation time. We will analyze metrics such as speed up, efficiency,time taken with varying data points, and number of processes to compare the two approaches and understand the relative performance improvement we can get.

Parallelization of the K-Means Algorithm with Applications to Big Data Clustering

TL;DR

Two different approaches to clustering using LLoyd's algorithm are compared and metrics such as speed up, efficiency, time taken with varying data points, and number of processes are analyzed to compare the two approaches and understand the relative performance improvement they can get.

Abstract

The K-Means clustering using LLoyd's algorithm is an iterative approach to partition the given dataset into K different clusters. The algorithm assigns each point to the cluster based on the following objective function The serial algorithm involves iterative steps where we compute the distance of each datapoint from the centroids and assign the datapoint to the nearest centroid. This approach is essentially known as the expectation-maximization step. Clustering involves extensive computations to calculate distances at each iteration, which increases as the number of data points increases. This provides scope for parallelism. However, we must ensure that in a parallel process, each thread has access to the updated centroid value and no racing condition exists on any centroid values. We will compare two different approaches in this project. The first approach is an OpenMP flat synchronous method where all processes are run in parallel, and we use synchronization to ensure safe updates of clusters. The second approach we adopt is a GPU based parallelization approach using OpenACC wherein we will try to make use of GPU architecture to parallelize chunks of the algorithm to observe decreased computation time. We will analyze metrics such as speed up, efficiency,time taken with varying data points, and number of processes to compare the two approaches and understand the relative performance improvement we can get.
Paper Structure (10 sections, 1 equation, 12 figures, 5 tables)

This paper contains 10 sections, 1 equation, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Results of Serial K-Means on 1M datapoints
  • Figure 2: Results of Parallel K-Means on 1M datapoints
  • Figure 3: Results of Serial K-Means on 400k datapoints
  • Figure 4: Results of parallel K-Means on 400k datapoints
  • Figure 5: Results of Serial K-Means on 500k 2D Dataset
  • ...and 7 more figures