Table of Contents
Fetching ...

K-Means Clustering With Incomplete Data with the Use of Mahalanobis Distances

Lovis Kwasi Armah, Igor Melnykov

TL;DR

This work tackles clustering with incomplete data by unifying imputation and clustering while incorporating Mahalanobis distances to better model non-spherical, elliptical clusters. It introduces a joint framework that optimizes cluster assignments, centers, missing values, and cluster covariances, using a log-likelihood-based objective. Empirical results on Iris and synthetic ellipsoidal data show that the proposed K-Mahal method consistently outperforms both imputation-then-K-means and the earlier unified approach, especially when missingness is moderate and cluster shapes are non-spherical. The findings highlight the practical value of jointly imputing and clustering with a shape-aware metric, and point to future work on tailored imputation techniques for elliptical clusters.

Abstract

Effectively applying the K-means algorithm to clustering tasks with incomplete features remains an important research area due to its impact on real-world applications. Recent work has shown that unifying K-means clustering and imputation into one single objective function and solving the resultant optimization yield superior results compared to handling imputation and clustering separately. In this work, we extend this approach by developing a unified K-means algorithm that incorporates Mahalanobis distances, instead of the traditional Euclidean distances, which previous research has shown to perform better for clusters with elliptical shapes. We conducted extensive experiments on synthetic datasets containing up to ten elliptical clusters, as well as the IRIS dataset. Using the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), we demonstrate that our algorithm consistently outperforms both standalone imputation followed by K-means (using either Mahalanobis or Euclidean distance) and K-Means with Incomplete Data, the recent K-means algorithms that integrate imputation and clustering for handling incomplete data. These results hold across both the IRIS dataset and randomly generated data with elliptical clusters.

K-Means Clustering With Incomplete Data with the Use of Mahalanobis Distances

TL;DR

This work tackles clustering with incomplete data by unifying imputation and clustering while incorporating Mahalanobis distances to better model non-spherical, elliptical clusters. It introduces a joint framework that optimizes cluster assignments, centers, missing values, and cluster covariances, using a log-likelihood-based objective. Empirical results on Iris and synthetic ellipsoidal data show that the proposed K-Mahal method consistently outperforms both imputation-then-K-means and the earlier unified approach, especially when missingness is moderate and cluster shapes are non-spherical. The findings highlight the practical value of jointly imputing and clustering with a shape-aware metric, and point to future work on tailored imputation techniques for elliptical clusters.

Abstract

Effectively applying the K-means algorithm to clustering tasks with incomplete features remains an important research area due to its impact on real-world applications. Recent work has shown that unifying K-means clustering and imputation into one single objective function and solving the resultant optimization yield superior results compared to handling imputation and clustering separately. In this work, we extend this approach by developing a unified K-means algorithm that incorporates Mahalanobis distances, instead of the traditional Euclidean distances, which previous research has shown to perform better for clusters with elliptical shapes. We conducted extensive experiments on synthetic datasets containing up to ten elliptical clusters, as well as the IRIS dataset. Using the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), we demonstrate that our algorithm consistently outperforms both standalone imputation followed by K-means (using either Mahalanobis or Euclidean distance) and K-Means with Incomplete Data, the recent K-means algorithms that integrate imputation and clustering for handling incomplete data. These results hold across both the IRIS dataset and randomly generated data with elliptical clusters.

Paper Structure

This paper contains 17 sections, 4 equations, 1 figure, 7 tables, 2 algorithms.

Figures (1)

  • Figure 1: The layout of the true cluster distribution (a), along with K-means (b), Unified K-means (c), and K-Mahal (d) using KNN imputation. Class 1 (orange squares) and Class 2 (blue triangles) are generated with $\check{\omega} = 0.1$. Incomplete observations (1.5%) are indicated by hollow square for Class 1 and hollow triangle for Class 2. Misclassified points are shown with circles.