SHADE: Deep Density-based Clustering

Anna Beer; Pascal Weber; Lukas Miklautz; Collin Leiber; Walid Durani; Christian Böhm; Claudia Plant

SHADE: Deep Density-based Clustering

Anna Beer, Pascal Weber, Lukas Miklautz, Collin Leiber, Walid Durani, Christian Böhm, Claudia Plant

TL;DR

SHADE tackles clustering in high-dimensional, noisy data by embedding density-connectivity into a deep autoencoder, learning representations that preserve density-connected structures while enabling automatic noise detection. It combines a density-connectivity loss with a reconstruction loss to produce embeddings where density-connected clusters are separated and their shapes preserved. A stability-based, MST-derived structure tree enables fully automatic clustering with automatic noise detection, without requiring predefined cluster counts. Empirical results show SHADE excels on non-Gaussian and video data, demonstrating the practical value of density-aware deep clustering, though some Gaussian-dominated datasets may still favor centroid-based methods.

Abstract

Detecting arbitrarily shaped clusters in high-dimensional noisy data is challenging for current clustering methods. We introduce SHADE (Structure-preserving High-dimensional Analysis with Density-based Exploration), the first deep clustering algorithm that incorporates density-connectivity into its loss function. Similar to existing deep clustering algorithms, SHADE supports high-dimensional and large data sets with the expressive power of a deep autoencoder. In contrast to most existing deep clustering methods that rely on a centroid-based clustering objective, SHADE incorporates a novel loss function that captures density-connectivity. SHADE thereby learns a representation that enhances the separation of density-connected clusters. SHADE detects a stable clustering and noise points fully automatically without any user input. It outperforms existing methods in clustering quality, especially on data that contain non-Gaussian clusters, such as video data. Moreover, the embedded space of SHADE is suitable for visualization and interpretation of the clustering results as the individual shapes of the clusters are preserved.

SHADE: Deep Density-based Clustering

TL;DR

Abstract

Paper Structure (42 sections, 4 equations, 6 figures, 9 tables)

This paper contains 42 sections, 4 equations, 6 figures, 9 tables.

Introduction
A novel deep density-based clustering method
Challenges of Density-Connected Structures
Separability between intra- and inter-cluster distances
Preserving distances between structurally relevant points
Non-contractible and intertwined clusters
Capturing Density-Connectivity
Background: Classic Density-Connectivity
Background: Hierarchical Density-Connected Structures
Density-Connectivity Loss
Preserving the Structure in the Embedding
Fully Automatic Clustering and Noise Detection in the Embedded Space
Stability of Clusters
Structure Tree
Stability
...and 27 more sections

Figures (6)

Figure 1: 3d dataset (a) and its 2d embedding created by our algorithm SHADE (b), a regular autoencoder (c), and its competitors (d)-(i); colors imply ground truth clusters and numbers in brackets show the clustering quality measured by ARI. SHADE separates the clusters and keeps their shapes, whereas other methods merge them or change the shape entirely.
Figure 2: OPTICS optics reachability plots of the 3d dataset (a) and of its 2d embedding created by our algorithm SHADE (b), a regular autoencoder (c), and our competitors (d)-(i); colors imply ground truth clusters and numbers in brackets show the cluster separability measured by the density cluster separability index (DCSI). SHADE retains the original 3d dataset's overall reachability structure and enhances cluster separability. In contrast, other methods merge the two intertwined clusters or fail to separate the intertwined clusters correctly. Additionally, the other methods reduce the overall cluster separability.
Figure 3: SHADE optimizes the loss functions $\mathcal{L}_D$ and $\mathcal{L}_{rec}$ simultaneously in a batch-wise manner. $\mathcal{L}_D$ aligns the density-connectivity in the original space with the Euclidean distances in the embedding. $\mathcal{L}_{rec}$, on the other hand, enforces that the original high-dimensional spatial structure or shape of the clusters is preserved in the learned embedding, allowing an accurate reconstruction. The final clustering $\mathcal{C}$ is obtained by selecting the clustering with the highest stability $S(\mathcal{C})$ based on the density-connectivity metric $d_{dc}$.
Figure 4: ARI on synthetic data with varying noise ratio and 100 dimensions. SHADE consistently surpasses all competitors. Note that SHADE reliably returns the highest ARI values, whereas our competitors have a high variance in clustering quality across ten runs. This shows the importance of inherent noise handling for deep clustering algorithms.
Figure 5: HAR Confusion Matrix. x-axis gives ground truth classes, y-axis clusters detected by SHADE
...and 1 more figures

SHADE: Deep Density-based Clustering

TL;DR

Abstract

SHADE: Deep Density-based Clustering

Authors

TL;DR

Abstract

Table of Contents

Figures (6)