Regularization and Optimization in Model-Based Clustering

Raphael Araujo Sampaio; Joaquim Dias Garcia; Marcus Poggi; Thibaut Vidal

Regularization and Optimization in Model-Based Clustering

Raphael Araujo Sampaio, Joaquim Dias Garcia, Marcus Poggi, Thibaut Vidal

TL;DR

The paper addresses the limitations of both k-means variants and general GMMs in unsupervised clustering by tackling ill-conditioned covariance estimation and the abundance of local optima. It introduces a Hybrid Genetic EM with Regularization (HGS), which combines a population-based search (with crossover via Hungarian matching and mutation) and a regularized EM that updates means, covariances, and mixture weights, using covariance shrinkage strategies such as Shrunk, Ledoit-Wolf, and OAS. Empirical results on synthetic and UCI data show that the joint use of regularization and advanced optimization yields substantial improvements in clustering accuracy (ARI) over both standard GMMs and k-means, often outperforming HG-means. The work provides open-source Julia packages and demonstrates the practical potential of pursuing general GMMs for data exploration, with promising directions for faster algorithms and domain-specific adaptations.

Abstract

Due to their conceptual simplicity, k-means algorithm variants have been extensively used for unsupervised cluster analysis. However, one main shortcoming of these algorithms is that they essentially fit a mixture of identical spherical Gaussians to data that vastly deviates from such a distribution. In comparison, general Gaussian Mixture Models (GMMs) can fit richer structures but require estimating a quadratic number of parameters per cluster to represent the covariance matrices. This poses two main issues: (i) the underlying optimization problems are challenging due to their larger number of local minima, and (ii) their solutions can overfit the data. In this work, we design search strategies that circumvent both issues. We develop more effective optimization algorithms for general GMMs, and we combine these algorithms with regularization strategies that avoid overfitting. Through extensive computational analyses, we observe that optimization or regularization in isolation does not substantially improve cluster recovery. However, combining these techniques permits a completely new level of performance previously unachieved by k-means algorithm variants, unraveling vastly different cluster structures. These results shed new light on the current status quo between GMM and k-means methods and suggest the more frequent use of general GMMs for data exploration. To facilitate such applications, we provide open-source code as well as Julia packages (UnsupervisedClustering.jl and RegularizedCovarianceMatrices.jl) implementing the proposed techniques.

Regularization and Optimization in Model-Based Clustering

TL;DR

Abstract

Paper Structure (14 sections, 10 equations, 13 figures, 10 tables, 2 algorithms)

This paper contains 14 sections, 10 equations, 13 figures, 10 tables, 2 algorithms.

Introduction
Fundamental Notions and Related Studies
A Hybrid Genetic EM with Regularization
Solution Representation and Population Initialization
Solution Generation by Crossover and Mutation
Local Search with a Regularized EM
Population Management
Computational Experiments
Datasets and Experimental Setup
Impact of Regularization
Combining Regularization and Optimization
Comparisons with k-means and HG-means
Performance on UCI Datasets
Conclusion

Figures (13)

Figure 1: Diagram representing the workflow of the binary tournament
Figure 2: Crossover operator: (a) and (b) represent the parents; (c) solution of the matching step; (d) offspring obtained after retaining one cluster for each matched pair
Figure 3: Mutation operator: (a) current solution; (b) selection of a random cluster (red $\times$ mark) and random data sample (blue $+$ mark); (c) reallocation of the cluster; (d) resulting solution after local search
Figure 4: Successive solutions in the EM-GMM
Figure 5: Behavior of the empirical and the regularized covariance estimations
...and 8 more figures

Regularization and Optimization in Model-Based Clustering

TL;DR

Abstract

Regularization and Optimization in Model-Based Clustering

Authors

TL;DR

Abstract

Table of Contents

Figures (13)