Classification EM-PCA for clustering and embedding
Zineddine Tighidet, Lazhar Labiod, Mohamed Nadif
TL;DR
We address the challenge of clustering high-dimensional data by proposing CEM-PCA, a unified framework that jointly learns a low-dimensional embedding and cluster assignments. The method optimizes a regularized objective that combines PCA-based embedding with a Classification EM clustering term and can incorporate graph Laplacian regularization to respect data geometry. Empirically, CEM-PCA outperforms baselines across synthetic, image, biomedical, and text datasets in both clustering and embedding quality, and it subsumes several state-of-the-art approaches as special cases or via regularized variants. The work advances representation learning by enabling simultaneous dimensionality reduction and clustering, with practical implications for scalable unsupervised learning and interpretable embeddings.
Abstract
The mixture model is undoubtedly one of the greatest contributions to clustering. For continuous data, Gaussian models are often used and the Expectation-Maximization (EM) algorithm is particularly suitable for estimating parameters from which clustering is inferred. If these models are particularly popular in various domains including image clustering, they however suffer from the dimensionality and also from the slowness of convergence of the EM algorithm. However, the Classification EM (CEM) algorithm, a classifying version, offers a fast convergence solution while dimensionality reduction still remains a challenge. Thus we propose in this paper an algorithm combining simultaneously and non-sequentially the two tasks --Data embedding and Clustering-- relying on Principal Component Analysis (PCA) and CEM. We demonstrate the interest of such approach in terms of clustering and data embedding. We also establish different connections with other clustering approaches.
