Table of Contents
Fetching ...

Inference with K-means

Alfred K. Adzika, Prudence Djagba

TL;DR

This research investigates the prediction of the last component of data points obtained from a distribution of clustered data using the online balanced k-means approach, finding that a larger number of clusters or partitions tends to yield lower errors while increasing the number of assigned data points does not significantly improve inference errors.

Abstract

This thesis aims to invent new approaches for making inferences with the k-means algorithm. k-means is an iterative clustering algorithm that randomly assigns k centroids, then assigns data points to the nearest centroid, and updates centroids based on the mean of assigned points. This process continues until convergence, forming k clusters where each point belongs to the closest centroid. This research investigates the prediction of the last component of data points obtained from a distribution of clustered data using the online balanced k-means approach. Through extensive experimentation and analysis, key findings have emerged. It is observed that a larger number of clusters or partitions tends to yield lower errors while increasing the number of assigned data points does not significantly improve inference errors. Reducing losses in the learning process does not significantly impact overall inference errors. Indicating that as learning is going on inference errors remain unchanged. Recommendations include the need for specialized inference techniques to estimate better data points derived from multi-clustered data and exploring methods that yield improved results with larger assigned datasets. By addressing these recommendations, this research advances the accuracy and reliability of inferences made with the k-means algorithm, bridging the gap between clustering and non-parametric density estimation and inference.

Inference with K-means

TL;DR

This research investigates the prediction of the last component of data points obtained from a distribution of clustered data using the online balanced k-means approach, finding that a larger number of clusters or partitions tends to yield lower errors while increasing the number of assigned data points does not significantly improve inference errors.

Abstract

This thesis aims to invent new approaches for making inferences with the k-means algorithm. k-means is an iterative clustering algorithm that randomly assigns k centroids, then assigns data points to the nearest centroid, and updates centroids based on the mean of assigned points. This process continues until convergence, forming k clusters where each point belongs to the closest centroid. This research investigates the prediction of the last component of data points obtained from a distribution of clustered data using the online balanced k-means approach. Through extensive experimentation and analysis, key findings have emerged. It is observed that a larger number of clusters or partitions tends to yield lower errors while increasing the number of assigned data points does not significantly improve inference errors. Reducing losses in the learning process does not significantly impact overall inference errors. Indicating that as learning is going on inference errors remain unchanged. Recommendations include the need for specialized inference techniques to estimate better data points derived from multi-clustered data and exploring methods that yield improved results with larger assigned datasets. By addressing these recommendations, this research advances the accuracy and reliability of inferences made with the k-means algorithm, bridging the gap between clustering and non-parametric density estimation and inference.

Paper Structure

This paper contains 36 sections, 11 equations, 8 figures.

Figures (8)

  • Figure 1: A Voronoi diagram
  • Figure 2: An illustration of the partitioning of the Gaussian and Uniform distribution using the online balanced k-means.
  • Figure 3: Data generated with different relations between them.
  • Figure 4: Data generated with different relations between them.
  • Figure 5: Effects of the number of clusters on losses and errors over every 1000 iterations
  • ...and 3 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2