Table of Contents
Fetching ...

Deep Clustering of Tabular Data by Weighted Gaussian Distribution Learning

Shourav B. Rabbani, Ivan V. Medri, Manar D. Samad

TL;DR

This work targets the gap in deep clustering for tabular data by replacing the common t-distribution assumption with a learnable mixture of multivariate Gaussians in the autoencoder latent space. G-CEALS jointly optimizes reconstruction and clustering losses, while treating cluster centroids, covariances, and weights as trainable parameters and updating a dynamic target distribution instead of relying on a fixed closed-form target. The method demonstrates superior average rankings in ACC and ARI across 16 OpenML-CC18 datasets, and exhibits favorable time complexity relative to other deep clustering baselines. By explicitly modeling cluster imbalance and tailoring optimization to tabular data statistics, G-CEALS offers a practical, scalable approach for unsupervised clustering of heterogeneous tabular data with potential to outperform traditional methods in many settings.

Abstract

Deep learning methods are primarily proposed for supervised learning of images or text with limited applications to clustering problems. In contrast, tabular data with heterogeneous features pose unique challenges in representation learning, where deep learning has yet to replace traditional machine learning. This paper addresses these challenges in developing one of the first deep clustering methods for tabular data: Gaussian Cluster Embedding in Autoencoder Latent Space (G-CEALS). G-CEALS is an unsupervised deep clustering framework for learning the parameters of multivariate Gaussian cluster distributions by iteratively updating individual cluster weights. The G-CEALS method presents average rank orderings of 2.9(1.7) and 2.8(1.7) based on clustering accuracy and adjusted Rand index (ARI) scores on sixteen tabular data sets, respectively, and outperforms nine state-of-the-art clustering methods. G-CEALS substantially improves clustering performance compared to traditional K-means and GMM, which are still de facto methods for clustering tabular data. Similar computationally efficient and high-performing deep clustering frameworks are imperative to reap the myriad benefits of deep learning on tabular data over traditional machine learning.

Deep Clustering of Tabular Data by Weighted Gaussian Distribution Learning

TL;DR

This work targets the gap in deep clustering for tabular data by replacing the common t-distribution assumption with a learnable mixture of multivariate Gaussians in the autoencoder latent space. G-CEALS jointly optimizes reconstruction and clustering losses, while treating cluster centroids, covariances, and weights as trainable parameters and updating a dynamic target distribution instead of relying on a fixed closed-form target. The method demonstrates superior average rankings in ACC and ARI across 16 OpenML-CC18 datasets, and exhibits favorable time complexity relative to other deep clustering baselines. By explicitly modeling cluster imbalance and tailoring optimization to tabular data statistics, G-CEALS offers a practical, scalable approach for unsupervised clustering of heterogeneous tabular data with potential to outperform traditional methods in many settings.

Abstract

Deep learning methods are primarily proposed for supervised learning of images or text with limited applications to clustering problems. In contrast, tabular data with heterogeneous features pose unique challenges in representation learning, where deep learning has yet to replace traditional machine learning. This paper addresses these challenges in developing one of the first deep clustering methods for tabular data: Gaussian Cluster Embedding in Autoencoder Latent Space (G-CEALS). G-CEALS is an unsupervised deep clustering framework for learning the parameters of multivariate Gaussian cluster distributions by iteratively updating individual cluster weights. The G-CEALS method presents average rank orderings of 2.9(1.7) and 2.8(1.7) based on clustering accuracy and adjusted Rand index (ARI) scores on sixteen tabular data sets, respectively, and outperforms nine state-of-the-art clustering methods. G-CEALS substantially improves clustering performance compared to traditional K-means and GMM, which are still de facto methods for clustering tabular data. Similar computationally efficient and high-performing deep clustering frameworks are imperative to reap the myriad benefits of deep learning on tabular data over traditional machine learning.
Paper Structure (27 sections, 13 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 13 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Two-dimensional embeddings of high dimensional image features extracted from a deep convolutional neural network obtained from Arefin2021.
  • Figure 2: Proposed deep clustering framework for tabular data. All samples of an unlabeled tabular data set are used to train the autoencoder in tandem with two subnetworks: a clustering module and an MLP head with a softmax output layer. The final cluster distribution (P) and assignments are obtained after the clustering module. The final cluster assignments are evaluated using ACC, ARI, and NMI performance metrics.
  • Figure 3: The reconstruction and clustering losses are obtained using the tabular data set with ID 1510 for two $\gamma$ values. A higher $\gamma$ value results in faster convergence of the clustering loss, slowing the reconstruction loss. However, a lower value is preferred to ensure smooth convergence of the cluster parameters and autoencoder weights.
  • Figure 4: t-SNE visualization of deep clustering of data set ID 458 with and without early stopping. While the deep clustering method facilitates cluster separation, it may merge the minority clusters due to the cluster imbalance in tabular data.
  • Figure 5: Convergence of cluster centroids ($\mu_j$), cluster covariances ($\Sigma_j$), and cluster weights ($\omega_j$) for two clusters using dataset ID 1510. Here, $t$ represents epoch, and $j$ is the cluster index.
  • ...and 1 more figures