Table of Contents
Fetching ...

FLASC: A Flare-Sensitive Clustering Algorithm

D. M. Bot, J. Peeters, J. Liesenborgs, J. Aerts

TL;DR

Fl flare-sensitive clustering (FLASC), an algorithm that detects branches within clusters to identify such shape-based subgroups and shows that both variants scale similarly to HDBSCAN* regarding computational cost and provide similar outputs across repeated runs.

Abstract

Clustering algorithms are often used to find subpopulations in exploratory data analysis workflows. Not only the clusters themselves, but also their shape can represent meaningful subpopulations. In this paper, we present FLASC, an algorithm that detects branches within clusters to identify such subpopulations. FLASC builds upon HDBSCAN*, a state-of-the-art density-based clustering algorithm, and detects branches in a post-processing step that describes within-cluster connectivity. Two variants of the algorithm are presented, which trade computational cost for noise robustness. We show that both variants scale similarly to HDBSCAN* in terms of computational cost and provide stable outputs using synthetic data sets, resulting in an efficient flare-sensitive clustering algorithm. In addition, we demonstrate the benefit of branch-detection on two real-world data sets.

FLASC: A Flare-Sensitive Clustering Algorithm

TL;DR

Fl flare-sensitive clustering (FLASC), an algorithm that detects branches within clusters to identify such shape-based subgroups and shows that both variants scale similarly to HDBSCAN* regarding computational cost and provide similar outputs across repeated runs.

Abstract

Clustering algorithms are often used to find subpopulations in exploratory data analysis workflows. Not only the clusters themselves, but also their shape can represent meaningful subpopulations. In this paper, we present FLASC, an algorithm that detects branches within clusters to identify such subpopulations. FLASC builds upon HDBSCAN*, a state-of-the-art density-based clustering algorithm, and detects branches in a post-processing step that describes within-cluster connectivity. Two variants of the algorithm are presented, which trade computational cost for noise robustness. We show that both variants scale similarly to HDBSCAN* in terms of computational cost and provide stable outputs using synthetic data sets, resulting in an efficient flare-sensitive clustering algorithm. In addition, we demonstrate the benefit of branch-detection on two real-world data sets.
Paper Structure (31 sections, 6 equations, 8 figures)

This paper contains 31 sections, 6 equations, 8 figures.

Figures (8)

  • Figure 1: Density-based clustering concepts behind HDBSCAN*. (a) A 2D example point cloud with varying density adapted from mcinnes2022documentation's online tutorial. (b) Density contours in a height map illustrate the data's density profile. Peaks in this density profile correspond to density contour clusters. (c) Clusters extracted from the density profile by HDBSCAN* indicated in colour. (d) The density contour tree describes how density contour clusters merge when considering lower density thresholds.
  • Figure 2: Density-based clustering concepts behind FLASC. (a) A within-cluster eccentricity $e(\mathbf{x}_i)$ is defined for each point $\mathbf{x}_i$ in cluster $C_j$ based on distances to the cluster's membership weighted average shown by the pentagon mark. (b) The cluster's eccentricity profile visualised as contours on a height map. Peaks in the profile correspond to branches in the cluster. (c) Branches extracted from the cluster by FLASC indicated in colour. The cluster's centre is given its own label. (d) The eccentricity contour tree describes how branches merge when considering lower eccentricity thresholds.
  • Figure 3: Different ways to combine cluster and branch membership probabilities. The cluster and branch probability average (a) and product (b) are visualised with desaturation. (c) Points labelled by the geodesically closest branch root---i.e., the point closest to the branch's weighted average---and desaturated as in (a). (d) Weighted branch membership for the orange branch is visualised by transparency. Branch memberships are computed from the traversal distance to the branch's root.
  • Figure 4: Explanatory figure for the centroid spread stability measure. Predicted centroids---weighted average coordinates---are computed for each predicted subgroup and assigned to the closest ground-truth centroid. The 95 percentile centroid spread can be interpreted by translating each group's average to the origin and finding the radius for a circle that includes 95% of the predicted centroids.
  • Figure 5: Results for the stability benchmark. (a) One point cloud with $b_l = 18$ and $n_r = 0.47$ coloured by ground truth and predicted labels that characterise the algorithms' behaviours. (b) Heatmap with average ARI values for FLASC and kMeans over all branch lengths and noise ratios. (c) Heatmap with 95% centroid spread values for FLASC and kMeans over all branch lengths and noise ratios.
  • ...and 3 more figures