Table of Contents
Fetching ...

AuToMATo: An Out-Of-The-Box Persistence-Based Clustering Algorithm

Marius Huber, Sara Kalisnik, Patrick Schnider

TL;DR

AuToMATo introduces a persistence-based, out-of-the-box clustering algorithm that automatically calibrates a density-peak merging threshold via a bottleneck bootstrap, yielding robust clustering without manual tuning. Built on ToMATo, it uses a bootstrap-based estimate $\widehat{q}_{\alpha}$ to set $\tau = 2\widehat{q}_{\alpha}/\sqrt{n}$, improving stability across datasets and enhancing Mapper workflows. Experimental results show competitive or superior performance versus parameter-free baselines and many tuned parametric methods, with practical Mapper applications yielding accurate Reeb graphs and clearer topological structure. The open-source Python package, compatible with scikit-learn, enables broad adoption and integration into topological data analysis pipelines.

Abstract

We present AuToMATo, a novel clustering algorithm based on persistent homology. While AuToMATo is not parameter-free per se, we provide default choices for its parameters that make it into an out-of-the-box clustering algorithm that performs well across the board. AuToMATo combines the existing ToMATo clustering algorithm with a bootstrapping procedure in order to separate significant peaks of an estimated density function from non-significant ones. We perform a thorough comparison of AuToMATo (with its parameters fixed to their defaults) against many other state-of-the-art clustering algorithms. We find not only that AuToMATo compares favorably against parameter-free clustering algorithms, but in many instances also significantly outperforms even the best selection of parameters for other algorithms. AuToMATo is motivated by applications in topological data analysis, in particular the Mapper algorithm, where it is desirable to work with a clustering algorithm that does not need tuning of its parameters. Indeed, we provide evidence that AuToMATo performs well when used with Mapper. Finally, we provide an open-source implementation of AuToMATo in Python that is fully compatible with the standard scikit-learn architecture.

AuToMATo: An Out-Of-The-Box Persistence-Based Clustering Algorithm

TL;DR

AuToMATo introduces a persistence-based, out-of-the-box clustering algorithm that automatically calibrates a density-peak merging threshold via a bottleneck bootstrap, yielding robust clustering without manual tuning. Built on ToMATo, it uses a bootstrap-based estimate to set , improving stability across datasets and enhancing Mapper workflows. Experimental results show competitive or superior performance versus parameter-free baselines and many tuned parametric methods, with practical Mapper applications yielding accurate Reeb graphs and clearer topological structure. The open-source Python package, compatible with scikit-learn, enables broad adoption and integration into topological data analysis pipelines.

Abstract

We present AuToMATo, a novel clustering algorithm based on persistent homology. While AuToMATo is not parameter-free per se, we provide default choices for its parameters that make it into an out-of-the-box clustering algorithm that performs well across the board. AuToMATo combines the existing ToMATo clustering algorithm with a bootstrapping procedure in order to separate significant peaks of an estimated density function from non-significant ones. We perform a thorough comparison of AuToMATo (with its parameters fixed to their defaults) against many other state-of-the-art clustering algorithms. We find not only that AuToMATo compares favorably against parameter-free clustering algorithms, but in many instances also significantly outperforms even the best selection of parameters for other algorithms. AuToMATo is motivated by applications in topological data analysis, in particular the Mapper algorithm, where it is desirable to work with a clustering algorithm that does not need tuning of its parameters. Indeed, we provide evidence that AuToMATo performs well when used with Mapper. Finally, we provide an open-source implementation of AuToMATo in Python that is fully compatible with the standard scikit-learn architecture.
Paper Structure (21 sections, 1 theorem, 9 equations, 14 figures, 1 algorithm)

This paper contains 21 sections, 1 theorem, 9 equations, 14 figures, 1 algorithm.

Key Result

Theorem 2.4

Let $X\subseteq\mathbb{R}^{N}$ be a sample consisting of $n$ data points drawn according to a probability density function $f\colon K\to[0,1]$, $K\subset\mathbb{R}^{N}$. Denote by $\mathcal{D}\coloneqq\mathrm{Dgm}(K, f)$ and $\widehat{\mathcal{D}}\coloneqq\mathrm{Dgm}(X, f)$ the corresponding unknow Then a consistent estimator for $q_{\alpha}$ is given by $\widehat{q}_{\alpha}$, which in turn is d

Figures (14)

  • Figure 1: A function $f\colon K\to\mathbb{R}$, $K\subset\mathbb{R}$, in red, and an estimate $\hat{f}$ of $f$ in blue (left), with corresponding persistence diagrams $\mathrm{Dgm}(K, f)$ and $\mathrm{Dgm}(\mathcal{G}, \hat{f})$ consisting of the red and blue dots, respectively, together with a dashed line separating noise from features (right).
  • Figure 2: Schematic of the methodology of AuToMATo: from a data set $X$, the usual ToMATo persistence diagram (with $\tau=+\infty$) is computed. Additionally, the analogous persistence diagrams are computed for the bootstrap samples $X_{1}^{*},\dots,X_{B}^{*}$, which are created from $X$ by drawing with replacement. Finally, the bootstrap procedure (indicated by $\otimes$) is used to compute a prominence threshold for the original persistence diagram.
  • Figure 3: Fowlkes-Mallows score of AuToMATo and DBSCAN across benchmarking data sets. The shading of "automato_mean" indicates the standard deviation of the score across the ten runs.
  • Figure 4: (a) input data set; result of Mapper with (b) AuToMATo; (c) DBSCAN; (d) HDBSCAN
  • Figure 5: Mapper applied to the diabetes data set with AuToMATo (left); DBSCAN (center); HDBSCAN (right). Labels 0, 1 and 2 stand for "no ", "chemical" and "overt diabetes".
  • ...and 9 more figures

Theorems & Definitions (5)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Theorem 2.4: chazal_bottleneck_bootstrap
  • Remark 3.1