Table of Contents
Fetching ...

A simulation study of cluster search algorithms in data set generated by Gaussian mixture models

Ryosuke Motegi, Yoichi Seki

TL;DR

This study examines centroid- and model-based cluster search algorithms in various cases that Gaussian mixture models (GMMs) can generate, and shows that some cluster-splitting criteria based on Euclidean distance make unreasonable decisions when clusters overlap.

Abstract

Determining the number of clusters is a fundamental issue in data clustering. Several algorithms have been proposed, including centroid-based algorithms using the Euclidean distance and model-based algorithms using a mixture of probability distributions. Among these, greedy algorithms for searching the number of clusters by repeatedly splitting or merging clusters have advantages in terms of computation time for problems with large sample sizes. However, studies comparing these methods in systematic evaluation experiments still need to be included. This study examines centroid- and model-based cluster search algorithms in various cases that Gaussian mixture models (GMMs) can generate. The cases are generated by combining five factors: dimensionality, sample size, the number of clusters, cluster overlap, and covariance type. The results show that some cluster-splitting criteria based on Euclidean distance make unreasonable decisions when clusters overlap. The results also show that model-based algorithms are insensitive to covariance type and cluster overlap compared to the centroid-based method if the sample size is sufficient. Our cluster search implementation codes are available at https://github.com/lipryou/searchClustK

A simulation study of cluster search algorithms in data set generated by Gaussian mixture models

TL;DR

This study examines centroid- and model-based cluster search algorithms in various cases that Gaussian mixture models (GMMs) can generate, and shows that some cluster-splitting criteria based on Euclidean distance make unreasonable decisions when clusters overlap.

Abstract

Determining the number of clusters is a fundamental issue in data clustering. Several algorithms have been proposed, including centroid-based algorithms using the Euclidean distance and model-based algorithms using a mixture of probability distributions. Among these, greedy algorithms for searching the number of clusters by repeatedly splitting or merging clusters have advantages in terms of computation time for problems with large sample sizes. However, studies comparing these methods in systematic evaluation experiments still need to be included. This study examines centroid- and model-based cluster search algorithms in various cases that Gaussian mixture models (GMMs) can generate. The cases are generated by combining five factors: dimensionality, sample size, the number of clusters, cluster overlap, and covariance type. The results show that some cluster-splitting criteria based on Euclidean distance make unreasonable decisions when clusters overlap. The results also show that model-based algorithms are insensitive to covariance type and cluster overlap compared to the centroid-based method if the sample size is sufficient. Our cluster search implementation codes are available at https://github.com/lipryou/searchClustK
Paper Structure (24 sections, 11 equations, 5 figures, 6 tables)

This paper contains 24 sections, 11 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Box plots of the response variable, which is the probit transformed cARI. Each figure corresponds to each factor combination selected as three-factor interactions.
  • Figure 2: Effect plot. (a) Main effects with confidence intervals. Black points indicate not significant. (b) Interaction effects between methods and $p$. (c) Three-factor interaction effects.
  • Figure 3: Example of the typical case that G-means does not work. The scatter plot shows two clusters obtained by $K$-means applied to the data set ($p=2$). The upper figure is the Q--Q plot of the red-colored cluster, calculated by G-means projection.
  • Figure 4: Histogram of Euclidean distances, calculated by $n=3000$ samples taken from the two-components GMM with spherical and homogeneous ($\bar{\omega} = 0.01$). The curve is the probability density function of a mixture of $\chi_p$ and $\chi_p(\lambda)$, where $\lambda = 3.4628$ in these cases.
  • Figure 5: Computation time and ARI. Each point represents the ARI of the result and the elapsed time taken by the algorithm to find the number of clusters for each data set. The representative value for each data set was chosen as described in Section \ref{['sec: simulation']}.