An Introductory Survey to Autoencoder-based Deep Clustering -- Sandboxes for Combining Clustering with Deep Learning
Collin Leiber, Lukas Miklautz, Claudia Plant, Christian Böhm
TL;DR
This survey addresses the challenge of clustering high-dimensional data without labels by leveraging autoencoders to learn clustering-friendly, non-linear embeddings. It systematically categorizes AE-based deep clustering methods along two dimensions: optimization strategy (sequential, alternating, simultaneous, generative) and neural network type, emphasizing AE architectures as versatile building blocks. By detailing representative algorithms (e.g., AE+$k$-Means, DEN, AEC, DCN, ACe/DeC, DipEncoder, DEC, IDEC, DCEC, DKM) and their objective formulations, the paper clarifies how reconstruction and clustering losses are balanced to produce effective embeddings and cluster assignments. It also discusses practical considerations such as parameterization, data augmentation, and self-supervised losses, and highlights future research directions in hyperparameter selection, robustness, and interpretability. Overall, the work consolidates the AE-based DC landscape, providing a foundation for researchers to design clustering-focused deep models and apply them across diverse data modalities.
Abstract
Autoencoders offer a general way of learning low-dimensional, non-linear representations from data without labels. This is achieved without making any particular assumptions about the data type or other domain knowledge. The generality and domain agnosticism in combination with their simplicity make autoencoders a perfect sandbox for researching and developing novel (deep) clustering algorithms. Clustering methods group data based on similarity, a task that benefits from the lower-dimensional representation learned by an autoencoder, mitigating the curse of dimensionality. Specifically, the combination of deep learning with clustering, called Deep Clustering, enables to learn a representation tailored to specific clustering tasks, leading to high-quality results. This survey provides an introduction to fundamental autoencoder-based deep clustering algorithms that serve as building blocks for many modern approaches.
