Table of Contents
Fetching ...

A novel k-means clustering approach using two distance measures for Gaussian data

Naitik Gada

TL;DR

This work tackles the robustness and accuracy of k-means clustering when data follow Gaussian-like structure by introducing a dual-distance objective that combines within-cluster distance ($WCD$) and inter-cluster distance ($ICD$). The number of clusters $k$ is selected using the Calinski-Harabasz criterion, and the algorithm updates centroids under the joint $WCD$/$ICD$ objective, with performance evaluated via OA, precision, recall, and F1. Across synthetic (2D/3D) and real-world UCI benchmarks (Iris, Wine, Breast Cancer), the ICD-enhanced approach consistently outperforms traditional k-means, showing better convergence, robustness to initialization, and improved handling of outliers. These results suggest that incorporating inter-cluster separation into the k-means objective yields a practical and effective clustering alternative for Gaussian-like data, motivating further exploration in higher dimensions and with diverse initializations.

Abstract

Clustering algorithms have long been the topic of research, representing the more popular side of unsupervised learning. Since clustering analysis is one of the best ways to find some clarity and structure within raw data, this paper explores a novel approach to \textit{k}-means clustering. Here we present a \textit{k}-means clustering algorithm that takes both the within cluster distance (WCD) and the inter cluster distance (ICD) as the distance metric to cluster the data into \emph{k} clusters pre-determined by the Calinski-Harabasz criterion in order to provide a more robust output for the clustering analysis. The idea with this approach is that by including both the measurement metrics, the convergence of the data into their clusters becomes solidified and more robust. We run the algorithm with some synthetically produced data and also some benchmark data sets obtained from the UCI repository. The results show that the convergence of the data into their respective clusters is more accurate by using both WCD and ICD measurement metrics. The algorithm is also better at clustering the outliers into their true clusters as opposed to the traditional \textit{k} means method. We also address some interesting possible research topics that reveal themselves as we answer the questions we initially set out to address.

A novel k-means clustering approach using two distance measures for Gaussian data

TL;DR

This work tackles the robustness and accuracy of k-means clustering when data follow Gaussian-like structure by introducing a dual-distance objective that combines within-cluster distance () and inter-cluster distance (). The number of clusters is selected using the Calinski-Harabasz criterion, and the algorithm updates centroids under the joint / objective, with performance evaluated via OA, precision, recall, and F1. Across synthetic (2D/3D) and real-world UCI benchmarks (Iris, Wine, Breast Cancer), the ICD-enhanced approach consistently outperforms traditional k-means, showing better convergence, robustness to initialization, and improved handling of outliers. These results suggest that incorporating inter-cluster separation into the k-means objective yields a practical and effective clustering alternative for Gaussian-like data, motivating further exploration in higher dimensions and with diverse initializations.

Abstract

Clustering algorithms have long been the topic of research, representing the more popular side of unsupervised learning. Since clustering analysis is one of the best ways to find some clarity and structure within raw data, this paper explores a novel approach to \textit{k}-means clustering. Here we present a \textit{k}-means clustering algorithm that takes both the within cluster distance (WCD) and the inter cluster distance (ICD) as the distance metric to cluster the data into \emph{k} clusters pre-determined by the Calinski-Harabasz criterion in order to provide a more robust output for the clustering analysis. The idea with this approach is that by including both the measurement metrics, the convergence of the data into their clusters becomes solidified and more robust. We run the algorithm with some synthetically produced data and also some benchmark data sets obtained from the UCI repository. The results show that the convergence of the data into their respective clusters is more accurate by using both WCD and ICD measurement metrics. The algorithm is also better at clustering the outliers into their true clusters as opposed to the traditional \textit{k} means method. We also address some interesting possible research topics that reveal themselves as we answer the questions we initially set out to address.

Paper Structure

This paper contains 25 sections, 13 equations, 17 figures, 7 tables, 1 algorithm.

Figures (17)

  • Figure 1: 2-dimensional data with variance = 0.5
  • Figure 2: Overall Accuracy of the proposed k-means vs traditional k-means over 100 iterations (variance = 0.5)
  • Figure 3: 2-dimensional data with variance = 1
  • Figure 4: Overall Accuracy of the proposed k-means vs traditional k-means over 100 iterations (variance = 1)
  • Figure 5: 3-dimensional data with variance = 0.5
  • ...and 12 more figures