Table of Contents
Fetching ...

Classification EM-PCA for clustering and embedding

Zineddine Tighidet, Lazhar Labiod, Mohamed Nadif

TL;DR

We address the challenge of clustering high-dimensional data by proposing CEM-PCA, a unified framework that jointly learns a low-dimensional embedding and cluster assignments. The method optimizes a regularized objective that combines PCA-based embedding with a Classification EM clustering term and can incorporate graph Laplacian regularization to respect data geometry. Empirically, CEM-PCA outperforms baselines across synthetic, image, biomedical, and text datasets in both clustering and embedding quality, and it subsumes several state-of-the-art approaches as special cases or via regularized variants. The work advances representation learning by enabling simultaneous dimensionality reduction and clustering, with practical implications for scalable unsupervised learning and interpretable embeddings.

Abstract

The mixture model is undoubtedly one of the greatest contributions to clustering. For continuous data, Gaussian models are often used and the Expectation-Maximization (EM) algorithm is particularly suitable for estimating parameters from which clustering is inferred. If these models are particularly popular in various domains including image clustering, they however suffer from the dimensionality and also from the slowness of convergence of the EM algorithm. However, the Classification EM (CEM) algorithm, a classifying version, offers a fast convergence solution while dimensionality reduction still remains a challenge. Thus we propose in this paper an algorithm combining simultaneously and non-sequentially the two tasks --Data embedding and Clustering-- relying on Principal Component Analysis (PCA) and CEM. We demonstrate the interest of such approach in terms of clustering and data embedding. We also establish different connections with other clustering approaches.

Classification EM-PCA for clustering and embedding

TL;DR

We address the challenge of clustering high-dimensional data by proposing CEM-PCA, a unified framework that jointly learns a low-dimensional embedding and cluster assignments. The method optimizes a regularized objective that combines PCA-based embedding with a Classification EM clustering term and can incorporate graph Laplacian regularization to respect data geometry. Empirically, CEM-PCA outperforms baselines across synthetic, image, biomedical, and text datasets in both clustering and embedding quality, and it subsumes several state-of-the-art approaches as special cases or via regularized variants. The work advances representation learning by enabling simultaneous dimensionality reduction and clustering, with practical implications for scalable unsupervised learning and interpretable embeddings.

Abstract

The mixture model is undoubtedly one of the greatest contributions to clustering. For continuous data, Gaussian models are often used and the Expectation-Maximization (EM) algorithm is particularly suitable for estimating parameters from which clustering is inferred. If these models are particularly popular in various domains including image clustering, they however suffer from the dimensionality and also from the slowness of convergence of the EM algorithm. However, the Classification EM (CEM) algorithm, a classifying version, offers a fast convergence solution while dimensionality reduction still remains a challenge. Thus we propose in this paper an algorithm combining simultaneously and non-sequentially the two tasks --Data embedding and Clustering-- relying on Principal Component Analysis (PCA) and CEM. We demonstrate the interest of such approach in terms of clustering and data embedding. We also establish different connections with other clustering approaches.

Paper Structure

This paper contains 37 sections, 1 theorem, 32 equations, 7 figures, 2 tables, 3 algorithms.

Key Result

Proposition 1

Let $\mathbf{X}_{n \times d}$ and $\mathbf{Q}_{d \times k}$ and $\mathbf{M}_{n \times k}$ be three matrices. Consider the constrained optimization problem The solution of Eq. (eq:TH1) comes from the singular value decomposition ( SVD) of $(\mathbf{X} \mathbf{Q}^+\delta \mathbf{M})$. Let $U D V^\top$ be the SVD for $(\mathbf{X} \mathbf{Q}^+\delta \mathbf{M})$, then $\mathbf{B}_{*} = UV^\top$.

Figures (7)

  • Figure 1: PCA projection of the Chang dataset onto the plan spawned by the first and second principal components (left) and the plan spawned by the first and fifteenth principal components (right).
  • Figure 2: Comparison between the clustering of K-means (left) and CEM-PCA (right) on Chang data using respectively the components arising from PCA for K-means and the data embedding $\mathbf{B}$ obtained by CEM-PCA. Black points represent 22% misclassified objects.
  • Figure 3: Diagram illustrating the steps of the proposed algorithm ( CEM-PCA).
  • Figure 4: Comparison between the clustering of K-means-PCA-2 and CEM-PCA, and the representation of the data embedding $\mathbf{B}$ -- the methods where applied on FCPS datasets and the plots where obtained using UMAP (black points represent misclassified objects). K-means-PCA-2 results from applying K-means on the two first components of PCA, CEM-PCA is our proposed method, and $\mathbf{B}$ is the data embedding obtained by CEM-PCA (see algorithm \ref{['alg:model_based_co_clustering:LBCEM']}.) This figure highlights the advantage of applying dimension reduction and clustering simultaneously ( CEM-PCA) rather than sequentially ( K-means-PCA-2). The embedding space of CEM-PCA (represented by $\mathbf{B}$ in the figure) perfectly captures the separation of the clusters, making it easy for the CEM algorithm to perform clustering.
  • Figure 5: Comparison of all models against each other with the Nemenyi test. Groups of models that are not significantly different (at $\rho = 0.10$) are connected (based on NMI).
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Proof 1