AuToMATo: An Out-Of-The-Box Persistence-Based Clustering Algorithm

Marius Huber; Sara Kalisnik; Patrick Schnider

AuToMATo: An Out-Of-The-Box Persistence-Based Clustering Algorithm

Marius Huber, Sara Kalisnik, Patrick Schnider

TL;DR

AuToMATo introduces a persistence-based, out-of-the-box clustering algorithm that automatically calibrates a density-peak merging threshold via a bottleneck bootstrap, yielding robust clustering without manual tuning. Built on ToMATo, it uses a bootstrap-based estimate $\widehat{q}_{\alpha}$ to set $\tau = 2\widehat{q}_{\alpha}/\sqrt{n}$, improving stability across datasets and enhancing Mapper workflows. Experimental results show competitive or superior performance versus parameter-free baselines and many tuned parametric methods, with practical Mapper applications yielding accurate Reeb graphs and clearer topological structure. The open-source Python package, compatible with scikit-learn, enables broad adoption and integration into topological data analysis pipelines.

Abstract

We present AuToMATo, a novel clustering algorithm based on persistent homology. While AuToMATo is not parameter-free per se, we provide default choices for its parameters that make it into an out-of-the-box clustering algorithm that performs well across the board. AuToMATo combines the existing ToMATo clustering algorithm with a bootstrapping procedure in order to separate significant peaks of an estimated density function from non-significant ones. We perform a thorough comparison of AuToMATo (with its parameters fixed to their defaults) against many other state-of-the-art clustering algorithms. We find not only that AuToMATo compares favorably against parameter-free clustering algorithms, but in many instances also significantly outperforms even the best selection of parameters for other algorithms. AuToMATo is motivated by applications in topological data analysis, in particular the Mapper algorithm, where it is desirable to work with a clustering algorithm that does not need tuning of its parameters. Indeed, we provide evidence that AuToMATo performs well when used with Mapper. Finally, we provide an open-source implementation of AuToMATo in Python that is fully compatible with the standard scikit-learn architecture.

AuToMATo: An Out-Of-The-Box Persistence-Based Clustering Algorithm

TL;DR

to set

, improving stability across datasets and enhancing Mapper workflows. Experimental results show competitive or superior performance versus parameter-free baselines and many tuned parametric methods, with practical Mapper applications yielding accurate Reeb graphs and clearer topological structure. The open-source Python package, compatible with scikit-learn, enables broad adoption and integration into topological data analysis pipelines.

Abstract

Paper Structure (21 sections, 1 theorem, 9 equations, 14 figures, 1 algorithm)

This paper contains 21 sections, 1 theorem, 9 equations, 14 figures, 1 algorithm.

Introduction
Background
Persistence and the ToMATo clustering algorithm
The bottleneck bootstrap
Methodology and implementation of AuToMATo
Methodology of AuToMATo
Implementation of AuToMATo
Choice of ToMATo parameters:
Choice of $\alpha$ and $B$:
Complexity analysis of Algorithm \ref{['alg:automato']}:
The Python package:
Experiments
Choice of clustering algorithms for comparison
Choice of data sets
Methodology of the experiments
...and 6 more sections

Key Result

Theorem 2.4

Let $X\subseteq\mathbb{R}^{N}$ be a sample consisting of $n$ data points drawn according to a probability density function $f\colon K\to[0,1]$, $K\subset\mathbb{R}^{N}$. Denote by $\mathcal{D}\coloneqq\mathrm{Dgm}(K, f)$ and $\widehat{\mathcal{D}}\coloneqq\mathrm{Dgm}(X, f)$ the corresponding unknow Then a consistent estimator for $q_{\alpha}$ is given by $\widehat{q}_{\alpha}$, which in turn is d

Figures (14)

Figure 1: A function $f\colon K\to\mathbb{R}$, $K\subset\mathbb{R}$, in red, and an estimate $\hat{f}$ of $f$ in blue (left), with corresponding persistence diagrams $\mathrm{Dgm}(K, f)$ and $\mathrm{Dgm}(\mathcal{G}, \hat{f})$ consisting of the red and blue dots, respectively, together with a dashed line separating noise from features (right).
Figure 2: Schematic of the methodology of AuToMATo: from a data set $X$, the usual ToMATo persistence diagram (with $\tau=+\infty$) is computed. Additionally, the analogous persistence diagrams are computed for the bootstrap samples $X_{1}^{*},\dots,X_{B}^{*}$, which are created from $X$ by drawing with replacement. Finally, the bootstrap procedure (indicated by $\otimes$) is used to compute a prominence threshold for the original persistence diagram.
Figure 3: Fowlkes-Mallows score of AuToMATo and DBSCAN across benchmarking data sets. The shading of "automato_mean" indicates the standard deviation of the score across the ten runs.
Figure 4: (a) input data set; result of Mapper with (b) AuToMATo; (c) DBSCAN; (d) HDBSCAN
Figure 5: Mapper applied to the diabetes data set with AuToMATo (left); DBSCAN (center); HDBSCAN (right). Labels 0, 1 and 2 stand for "no ", "chemical" and "overt diabetes".
...and 9 more figures

Theorems & Definitions (5)

Definition 2.1
Definition 2.2
Definition 2.3
Theorem 2.4: chazal_bottleneck_bootstrap
Remark 3.1

AuToMATo: An Out-Of-The-Box Persistence-Based Clustering Algorithm

TL;DR

Abstract

AuToMATo: An Out-Of-The-Box Persistence-Based Clustering Algorithm

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (5)