Deep Clustering via Distribution Learning
Guanfang Dong, Zijie Tan, Chenqiu Zhao, Anup Basu
TL;DR
Deep Clustering via Distribution Learning (DCDL) addresses the lack of theoretical grounding in combining clustering with distribution learning by modeling the dataset as a mixture of distributions and linking clustering to redistribution of this prior. It builds a clustering-optimized distribution learning framework, MCMarg-C, that initializes GMM means via k-means, imposes a Gaussian Mixture Weight Standard Deviation Loss $L_{GMM-WSD}$, and minimizes a KL-divergence objective via Monte-Carlo marginalization along random vectors, all within an end-to-end deep clustering pipeline. The method integrates an autoencoder for dimensionality reduction, UMAP for manifold embedding, and returns clustering labels through a differentiable objective that couples $L_{KL}$ with $L_{GMM-WSD}$, yielding $\mathcal{L}(\mathbf{x},\mathbf{w},\Theta)=\mathcal{L}_{KL}(q(\mathbf{x}),\mathbf{w}^T\mathbf{\Psi}(\mathbf{x};\Theta)) + c\,L_{GMM-WSD}$. Empirically, DCDL with MCMarg-C achieves state-of-the-art or competitive results on MNIST, FashionMNIST, USPS, and Pendigits, demonstrating robustness to high dimensionality and producing well-balanced clusters with improved accuracy and information metrics.
Abstract
Distribution learning finds probability density functions from a set of data samples, whereas clustering aims to group similar data points to form clusters. Although there are deep clustering methods that employ distribution learning methods, past work still lacks theoretical analysis regarding the relationship between clustering and distribution learning. Thus, in this work, we provide a theoretical analysis to guide the optimization of clustering via distribution learning. To achieve better results, we embed deep clustering guided by a theoretical analysis. Furthermore, the distribution learning method cannot always be directly applied to data. To overcome this issue, we introduce a clustering-oriented distribution learning method called Monte-Carlo Marginalization for Clustering. We integrate Monte-Carlo Marginalization for Clustering into Deep Clustering, resulting in Deep Clustering via Distribution Learning (DCDL). Eventually, the proposed DCDL achieves promising results compared to state-of-the-art methods on popular datasets. Considering a clustering task, the new distribution learning method outperforms previous methods as well.
