Table of Contents
Fetching ...

SHADE: Deep Density-based Clustering

Anna Beer, Pascal Weber, Lukas Miklautz, Collin Leiber, Walid Durani, Christian Böhm, Claudia Plant

TL;DR

SHADE tackles clustering in high-dimensional, noisy data by embedding density-connectivity into a deep autoencoder, learning representations that preserve density-connected structures while enabling automatic noise detection. It combines a density-connectivity loss with a reconstruction loss to produce embeddings where density-connected clusters are separated and their shapes preserved. A stability-based, MST-derived structure tree enables fully automatic clustering with automatic noise detection, without requiring predefined cluster counts. Empirical results show SHADE excels on non-Gaussian and video data, demonstrating the practical value of density-aware deep clustering, though some Gaussian-dominated datasets may still favor centroid-based methods.

Abstract

Detecting arbitrarily shaped clusters in high-dimensional noisy data is challenging for current clustering methods. We introduce SHADE (Structure-preserving High-dimensional Analysis with Density-based Exploration), the first deep clustering algorithm that incorporates density-connectivity into its loss function. Similar to existing deep clustering algorithms, SHADE supports high-dimensional and large data sets with the expressive power of a deep autoencoder. In contrast to most existing deep clustering methods that rely on a centroid-based clustering objective, SHADE incorporates a novel loss function that captures density-connectivity. SHADE thereby learns a representation that enhances the separation of density-connected clusters. SHADE detects a stable clustering and noise points fully automatically without any user input. It outperforms existing methods in clustering quality, especially on data that contain non-Gaussian clusters, such as video data. Moreover, the embedded space of SHADE is suitable for visualization and interpretation of the clustering results as the individual shapes of the clusters are preserved.

SHADE: Deep Density-based Clustering

TL;DR

SHADE tackles clustering in high-dimensional, noisy data by embedding density-connectivity into a deep autoencoder, learning representations that preserve density-connected structures while enabling automatic noise detection. It combines a density-connectivity loss with a reconstruction loss to produce embeddings where density-connected clusters are separated and their shapes preserved. A stability-based, MST-derived structure tree enables fully automatic clustering with automatic noise detection, without requiring predefined cluster counts. Empirical results show SHADE excels on non-Gaussian and video data, demonstrating the practical value of density-aware deep clustering, though some Gaussian-dominated datasets may still favor centroid-based methods.

Abstract

Detecting arbitrarily shaped clusters in high-dimensional noisy data is challenging for current clustering methods. We introduce SHADE (Structure-preserving High-dimensional Analysis with Density-based Exploration), the first deep clustering algorithm that incorporates density-connectivity into its loss function. Similar to existing deep clustering algorithms, SHADE supports high-dimensional and large data sets with the expressive power of a deep autoencoder. In contrast to most existing deep clustering methods that rely on a centroid-based clustering objective, SHADE incorporates a novel loss function that captures density-connectivity. SHADE thereby learns a representation that enhances the separation of density-connected clusters. SHADE detects a stable clustering and noise points fully automatically without any user input. It outperforms existing methods in clustering quality, especially on data that contain non-Gaussian clusters, such as video data. Moreover, the embedded space of SHADE is suitable for visualization and interpretation of the clustering results as the individual shapes of the clusters are preserved.
Paper Structure (42 sections, 4 equations, 6 figures, 9 tables)

This paper contains 42 sections, 4 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: 3d dataset (a) and its 2d embedding created by our algorithm SHADE (b), a regular autoencoder (c), and its competitors (d)-(i); colors imply ground truth clusters and numbers in brackets show the clustering quality measured by ARI. SHADE separates the clusters and keeps their shapes, whereas other methods merge them or change the shape entirely.
  • Figure 2: OPTICS optics reachability plots of the 3d dataset (a) and of its 2d embedding created by our algorithm SHADE (b), a regular autoencoder (c), and our competitors (d)-(i); colors imply ground truth clusters and numbers in brackets show the cluster separability measured by the density cluster separability index (DCSI). SHADE retains the original 3d dataset's overall reachability structure and enhances cluster separability. In contrast, other methods merge the two intertwined clusters or fail to separate the intertwined clusters correctly. Additionally, the other methods reduce the overall cluster separability.
  • Figure 3: SHADE optimizes the loss functions $\mathcal{L}_D$ and $\mathcal{L}_{rec}$ simultaneously in a batch-wise manner. $\mathcal{L}_D$ aligns the density-connectivity in the original space with the Euclidean distances in the embedding. $\mathcal{L}_{rec}$, on the other hand, enforces that the original high-dimensional spatial structure or shape of the clusters is preserved in the learned embedding, allowing an accurate reconstruction. The final clustering $\mathcal{C}$ is obtained by selecting the clustering with the highest stability $S(\mathcal{C})$ based on the density-connectivity metric $d_{dc}$.
  • Figure 4: ARI on synthetic data with varying noise ratio and 100 dimensions. SHADE consistently surpasses all competitors. Note that SHADE reliably returns the highest ARI values, whereas our competitors have a high variance in clustering quality across ten runs. This shows the importance of inherent noise handling for deep clustering algorithms.
  • Figure 5: HAR Confusion Matrix. x-axis gives ground truth classes, y-axis clusters detected by SHADE
  • ...and 1 more figures