Table of Contents
Fetching ...

Dying Clusters Is All You Need -- Deep Clustering With an Unknown Number of Clusters

Collin Leiber, Niklas Strauß, Matthias Schubert, Thomas Seidl

TL;DR

Addressing unsupervised clustering when the exact number of clusters is unknown, this paper introduces UNSEEN, a general framework that starts from an upper bound $k_{init}$ and progressively dissolves dying clusters defined by the ratio $|C_i^j|/|C_i^0| < t$, updating the active count $k_j$. A nearest-neighbor loss $L_{UNSEEN}$ is added to mitigate initialization bias and encourage cluster merging, with a variant $L_{UNSEEN}^{simul}$ for simultaneous algorithms to avoid embedding collapse, yielding total loss $L_{total} = \

Abstract

Finding meaningful groups, i.e., clusters, in high-dimensional data such as images or texts without labeled data at hand is an important challenge in data mining. In recent years, deep clustering methods have achieved remarkable results in these tasks. However, most of these methods require the user to specify the number of clusters in advance. This is a major limitation since the number of clusters is typically unknown if labeled data is unavailable. Thus, an area of research has emerged that addresses this problem. Most of these approaches estimate the number of clusters separated from the clustering process. This results in a strong dependency of the clustering result on the quality of the initial embedding. Other approaches are tailored to specific clustering processes, making them hard to adapt to other scenarios. In this paper, we propose UNSEEN, a general framework that, starting from a given upper bound, is able to estimate the number of clusters. To the best of our knowledge, it is the first method that can be easily combined with various deep clustering algorithms. We demonstrate the applicability of our approach by combining UNSEEN with the popular deep clustering algorithms DCN, DEC, and DKM and verify its effectiveness through an extensive experimental evaluation on several image and tabular datasets. Moreover, we perform numerous ablations to analyze our approach and show the importance of its components. The code is available at: https://github.com/collinleiber/UNSEEN

Dying Clusters Is All You Need -- Deep Clustering With an Unknown Number of Clusters

TL;DR

Addressing unsupervised clustering when the exact number of clusters is unknown, this paper introduces UNSEEN, a general framework that starts from an upper bound and progressively dissolves dying clusters defined by the ratio , updating the active count . A nearest-neighbor loss is added to mitigate initialization bias and encourage cluster merging, with a variant for simultaneous algorithms to avoid embedding collapse, yielding total loss $L_{total} = \

Abstract

Finding meaningful groups, i.e., clusters, in high-dimensional data such as images or texts without labeled data at hand is an important challenge in data mining. In recent years, deep clustering methods have achieved remarkable results in these tasks. However, most of these methods require the user to specify the number of clusters in advance. This is a major limitation since the number of clusters is typically unknown if labeled data is unavailable. Thus, an area of research has emerged that addresses this problem. Most of these approaches estimate the number of clusters separated from the clustering process. This results in a strong dependency of the clustering result on the quality of the initial embedding. Other approaches are tailored to specific clustering processes, making them hard to adapt to other scenarios. In this paper, we propose UNSEEN, a general framework that, starting from a given upper bound, is able to estimate the number of clusters. To the best of our knowledge, it is the first method that can be easily combined with various deep clustering algorithms. We demonstrate the applicability of our approach by combining UNSEEN with the popular deep clustering algorithms DCN, DEC, and DKM and verify its effectiveness through an extensive experimental evaluation on several image and tabular datasets. Moreover, we perform numerous ablations to analyze our approach and show the importance of its components. The code is available at: https://github.com/collinleiber/UNSEEN

Paper Structure

This paper contains 9 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Estimated number of clusters ($k\_pred$) for synthetic datasets with the true number of clusters ($k\_true$) within $[5, 30]$. The colored area marks the $95\%$ confidence interval.
  • Figure 2: Estimated number of clusters ($k\_pred$) for MNIST when considering only the first ($k\_true$) clusters. The colored area marks the $95\%$ confidence interval.
  • Figure 3: Visualizations of different epochs of UNSEEN+DEC executed on USPS with and without using $\mathcal{L}_\text{UNSEEN}$. A two-dimensional representation of the embedding is obtained by applying t-SNE. Colors correspond to the current cluster labels.
  • Figure 4: The plots show clustering results regarding ACC for different values of the dying threshold $t$. The colored area marks the $95\%$ confidence interval.