Table of Contents
Fetching ...

Persistent Multiscale Density-based Clustering

Daniël Bot, Leland McInnes, Jan Aerts

TL;DR

PLSCAN addresses the challenge of density-based clustering for exploratory data analysis by eliminating heavy hyperparameter tuning. It builds a leaf-cluster hierarchy from a single condensed HDBSCAN* tree while varying only the minimum cluster size, and uses a persistence-based selection to reveal stable clusters across scales. By framing the method in terms of zero-dimensional persistent homology on a novel distance, PLSCAN provides a principled, scalable way to identify multi-level density maxima with robust performance demonstrated against HDBSCAN* and k-Means on real-world datasets. The approach yields higher average ARI on several benchmarks, reduces sensitivity to the neighbor parameter, and offers competitive runtimes, making it a practical tool for multi-resolution pattern discovery in complex data.

Abstract

Clustering is a cornerstone of modern data analysis. Detecting clusters in exploratory data analyses (EDA) requires algorithms that make few assumptions about the data. Density-based clustering algorithms are particularly well-suited for EDA because they describe high-density regions, assuming only that a density exists. Applying density-based clustering algorithms in practice, however, requires selecting appropriate hyperparameters, which is difficult without prior knowledge of the data distribution. For example, DBSCAN requires selecting a density threshold, and HDBSCAN* relies on a minimum cluster size parameter. In this work, we propose Persistent Leaves Spatial Clustering for Applications with Noise (PLSCAN). This novel density-based clustering algorithm efficiently identifies all minimum cluster sizes for which HDBSCAN* produces stable (leaf) clusters. PLSCAN applies scale-space clustering principles and is equivalent to persistent homology on a novel metric space. We compare its performance to HDBSCAN* on several real-world datasets, demonstrating that it achieves a higher average ARI and is less sensitive to changes in the number of mutual reachability neighbours. Additionally, we compare PLSCAN's computational costs to k-Means, demonstrating competitive run-times on low-dimensional datasets. At higher dimensions, run times scale more similarly to HDBSCAN*.

Persistent Multiscale Density-based Clustering

TL;DR

PLSCAN addresses the challenge of density-based clustering for exploratory data analysis by eliminating heavy hyperparameter tuning. It builds a leaf-cluster hierarchy from a single condensed HDBSCAN* tree while varying only the minimum cluster size, and uses a persistence-based selection to reveal stable clusters across scales. By framing the method in terms of zero-dimensional persistent homology on a novel distance, PLSCAN provides a principled, scalable way to identify multi-level density maxima with robust performance demonstrated against HDBSCAN* and k-Means on real-world datasets. The approach yields higher average ARI on several benchmarks, reduces sensitivity to the neighbor parameter, and offers competitive runtimes, making it a practical tool for multi-resolution pattern discovery in complex data.

Abstract

Clustering is a cornerstone of modern data analysis. Detecting clusters in exploratory data analyses (EDA) requires algorithms that make few assumptions about the data. Density-based clustering algorithms are particularly well-suited for EDA because they describe high-density regions, assuming only that a density exists. Applying density-based clustering algorithms in practice, however, requires selecting appropriate hyperparameters, which is difficult without prior knowledge of the data distribution. For example, DBSCAN requires selecting a density threshold, and HDBSCAN* relies on a minimum cluster size parameter. In this work, we propose Persistent Leaves Spatial Clustering for Applications with Noise (PLSCAN). This novel density-based clustering algorithm efficiently identifies all minimum cluster sizes for which HDBSCAN* produces stable (leaf) clusters. PLSCAN applies scale-space clustering principles and is equivalent to persistent homology on a novel metric space. We compare its performance to HDBSCAN* on several real-world datasets, demonstrating that it achieves a higher average ARI and is less sensitive to changes in the number of mutual reachability neighbours. Additionally, we compare PLSCAN's computational costs to k-Means, demonstrating competitive run-times on low-dimensional datasets. At higher dimensions, run times scale more similarly to HDBSCAN*.

Paper Structure

This paper contains 27 sections, 14 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: OPTICS-like visualisation ankerst1999optics demonstrating the minimum cluster size parameter's smoothing effect. (a) A 2D point cloud from mcinnes2022documentation with HDBSCAN* leaf-clusters for $k=5$ and $m_{\rm c} = 100$. (b) HDBSCAN*'s modelled density profile at multiple minimum cluster sizes. The $x$-axis contains all data points ordered to mimic the shape of probability density functions. The ordering is fixed for all minimum cluster size thresholds and does not encode proximity in data space. Colour encodes the minimum cluster size threshold. Higher thresholds smooth the modelled density profile by pruning small peaks. The color-strip at the bottom indicates points' cluster label.
  • Figure 2: Leaf tree construction explainer. (a) Condensed trees at multiple minimum cluster sizes for a 2D point cloud from mcinnes2022documentation. The tree at the lowest threshold contains all merges present in the later trees. The size and distance at which segments merge remain constant regardless of the threshold. (b) A simple condensed tree $\bm{C}$ with two cluster merges and $\rm N = 150$, annotated for the first cluster merge. Cluster segments are identified by their leaf tree index $i$. (c) The resulting leaf tree, where $i$ indicates the leaf tree index that serves as an identifier. The background colours indicate when values were written: red for default values, teal for the first step, and violet for the second step. The uncoloured values are written in separate post-processing steps.
  • Figure 3: Novel PLSCAN concepts demonstrated on the data from Fig. \ref{['fig:algorithm:leaf-tree:condensed-trees']}. (a) The leaf tree describes which local density maxima exist at each cluster size threshold. Colours indicate the top-10 highest total persistence peaks. Icicle widths encode the clusters' excess of mass, i.e., the distance persistence sum over all points in the cluster campello2015hdbscan. The clusters' birth distances increase with the minimum cluster size (Fig. \ref{['fig:algorithm:leaf-tree:condensed-trees']}), leading to smaller persistences at higher thresholds. (b) The persistence trace quantifies clustering 'quality'. We compute the total size persistence over all leaf clusters that exist at a particular minimum cluster size threshold. Alternatively, size--$d$ or size--$\lambda$ bi-persistences can be computed to incorporate the leaf-clusters' distance or density persistences. Dotted lines indicate local persistence trace maxima. (c) Peaks in the persistence trace represent other stable clusterings. PLSCAN can efficiently compute flat clusterings for these peaks with its cluster layers. The values between brackets indicate the total persistence for each layer.
  • Figure 4: Clusterings for different values of $k$ on the data from Fig. \ref{['fig:algorithm:plscan']}. PLSCAN automatically finds a 'good' minimum cluster size threshold. Consequently, its clusterings vary predictably with $k$. For example, the detected clusters shrink at higher $k$ values. HDBSCAN*'s clusterings, on the other hand, vary considerably with $k$ when the minimum cluster size is also set to $k$mcinnes2017hdbscanmcinnes2023fasthdbscan.
  • Figure 5: ARI--$k$ curves for each algorithm configuration. Coloured lines indicate the algorithm configurations: HDBSCAN* with EOM in blue and leaf orange; PLSCAN with size in green, size-$d$ in red, size-$\lambda$ in purple, $d$ in brown, and $\lambda$ in pink. Lighter shaded lines indicate scores for the best top-5 PLSCAN layer per $k$. The ARI was Lowess-interpolated cleveland1979robust over all evaluated datasets (Tab. \ref{['tab:demo:sensitivity:datasets']}). The interpolation considered the $5\%$ closest samples along $k$ for estimating each ARI value. The ARI scores were divided by the highest observed value on each dataset prior to the interpolation to retain the curves' shape. Consequently, the resulting curves do not indicate absolute performance. Instead, they describe changes in performance as a function of $k$. Shaded areas indicate the $95\%$ confidence intervals. A faint dotted line in black indicates the maximum observed (normalised) value.
  • ...and 6 more figures