Table of Contents
Fetching ...

Deep Clustering via Distribution Learning

Guanfang Dong, Zijie Tan, Chenqiu Zhao, Anup Basu

TL;DR

Deep Clustering via Distribution Learning (DCDL) addresses the lack of theoretical grounding in combining clustering with distribution learning by modeling the dataset as a mixture of distributions and linking clustering to redistribution of this prior. It builds a clustering-optimized distribution learning framework, MCMarg-C, that initializes GMM means via k-means, imposes a Gaussian Mixture Weight Standard Deviation Loss $L_{GMM-WSD}$, and minimizes a KL-divergence objective via Monte-Carlo marginalization along random vectors, all within an end-to-end deep clustering pipeline. The method integrates an autoencoder for dimensionality reduction, UMAP for manifold embedding, and returns clustering labels through a differentiable objective that couples $L_{KL}$ with $L_{GMM-WSD}$, yielding $\mathcal{L}(\mathbf{x},\mathbf{w},\Theta)=\mathcal{L}_{KL}(q(\mathbf{x}),\mathbf{w}^T\mathbf{\Psi}(\mathbf{x};\Theta)) + c\,L_{GMM-WSD}$. Empirically, DCDL with MCMarg-C achieves state-of-the-art or competitive results on MNIST, FashionMNIST, USPS, and Pendigits, demonstrating robustness to high dimensionality and producing well-balanced clusters with improved accuracy and information metrics.

Abstract

Distribution learning finds probability density functions from a set of data samples, whereas clustering aims to group similar data points to form clusters. Although there are deep clustering methods that employ distribution learning methods, past work still lacks theoretical analysis regarding the relationship between clustering and distribution learning. Thus, in this work, we provide a theoretical analysis to guide the optimization of clustering via distribution learning. To achieve better results, we embed deep clustering guided by a theoretical analysis. Furthermore, the distribution learning method cannot always be directly applied to data. To overcome this issue, we introduce a clustering-oriented distribution learning method called Monte-Carlo Marginalization for Clustering. We integrate Monte-Carlo Marginalization for Clustering into Deep Clustering, resulting in Deep Clustering via Distribution Learning (DCDL). Eventually, the proposed DCDL achieves promising results compared to state-of-the-art methods on popular datasets. Considering a clustering task, the new distribution learning method outperforms previous methods as well.

Deep Clustering via Distribution Learning

TL;DR

Deep Clustering via Distribution Learning (DCDL) addresses the lack of theoretical grounding in combining clustering with distribution learning by modeling the dataset as a mixture of distributions and linking clustering to redistribution of this prior. It builds a clustering-optimized distribution learning framework, MCMarg-C, that initializes GMM means via k-means, imposes a Gaussian Mixture Weight Standard Deviation Loss , and minimizes a KL-divergence objective via Monte-Carlo marginalization along random vectors, all within an end-to-end deep clustering pipeline. The method integrates an autoencoder for dimensionality reduction, UMAP for manifold embedding, and returns clustering labels through a differentiable objective that couples with , yielding . Empirically, DCDL with MCMarg-C achieves state-of-the-art or competitive results on MNIST, FashionMNIST, USPS, and Pendigits, demonstrating robustness to high dimensionality and producing well-balanced clusters with improved accuracy and information metrics.

Abstract

Distribution learning finds probability density functions from a set of data samples, whereas clustering aims to group similar data points to form clusters. Although there are deep clustering methods that employ distribution learning methods, past work still lacks theoretical analysis regarding the relationship between clustering and distribution learning. Thus, in this work, we provide a theoretical analysis to guide the optimization of clustering via distribution learning. To achieve better results, we embed deep clustering guided by a theoretical analysis. Furthermore, the distribution learning method cannot always be directly applied to data. To overcome this issue, we introduce a clustering-oriented distribution learning method called Monte-Carlo Marginalization for Clustering. We integrate Monte-Carlo Marginalization for Clustering into Deep Clustering, resulting in Deep Clustering via Distribution Learning (DCDL). Eventually, the proposed DCDL achieves promising results compared to state-of-the-art methods on popular datasets. Considering a clustering task, the new distribution learning method outperforms previous methods as well.
Paper Structure (16 sections, 18 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 18 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: The Relationship between Clustering and Distribution Learning. (a): Gray points represent the data to be clustered. (b): The process of distribution learning. We consider each data point is sampled from an underlying distribution, shown as Step 0 with each point possessing a distinct color. Then, to formulate an explicit expression of the distribution with cluster information, we redistribute the model components and align with the underlying prior distribution iteratively, as shown from Step 1 to the last step. This optimization objective aligns with clustering.
  • Figure 2: Pipeline of Deep Clustering via Distribution Learning (DCDL). The symbols depicted in the figure can be found with corresponding explanations in Algorithm \ref{['algo:2']}. Different colors in subfigures (a), (b), and (c) represent different labels in the MNIST dataset. The arrows in (c) represent the direction of marginalization in Monte Carlo Marginalization for Clustering (MCMarg-C).
  • Figure 3: Visualizing latent space of the MNIST dataset using autoencoder with and without using UMAP. We visualize the plane projections of 0- and 1-dimensional spaces. We observe that the latent space transformed by UMAP exhibits sparser distributions between different labels and denser concentrations of points within each label.
  • Figure 4: Visual Comparison of MCMarg and MCMarg-C Clustering Result. In each row, there is a separate control group. On the left side are the visual results of MCMarg, while on the right side are the visual results of MCMarg-C. Each cluster is represented by points of different colors. The pie chart illustrates the proportion of different points in the overall distribution. We can observe that MCMarg-C exhibits a more uniform clustering pattern, while MCMarg tends to use a smaller number of Gaussians to describe the data distribution.
  • Figure 5: DCDL Error Cluster Examples on the MNIST Dataset. Real Label represents the true label of the images on the right. Incorrect Cluster Visualization shows the visual results of mis-clustered examples. The label results of DCDL are shown above each image. For Human Accuracy, we sought annotations from three individuals considering randomized image presentation. Accuracy reflects the agreement between human annotations and the ground truth labels in the dataset.