Analysing Multiscale Clusterings with Persistent Homology

Juni Schindler; Mauricio Barahona

Analysing Multiscale Clusterings with Persistent Homology

Juni Schindler, Mauricio Barahona

TL;DR

The paper introduces the Multiscale Clustering Filtration (MCF), a topological framework to analyze non-hierarchical sequences of partitions across scales. It leverages persistent homology to separately quantify hierarchy via $PH_0$ and detect cross-scale conflicts via higher-dimensional $PH$, with a nerve-based construction and a hierarchical filtration that reduces to a VR filtration on ultrametric spaces. Through synthetic SBMs, the authors show PDs provide a compact, discriminative feature map for multiscale structure and demonstrate robustness via Wasserstein distances on diagrams. The work bridges cluster analysis with TDA, enabling comparison and learning on non-hierarchical multiscale clusterings and offering practical implementations and future directions for stability and scalability.

Abstract

In data clustering, it is often desirable to find not just a single partition into clusters but a sequence of partitions that describes the data at different scales (or levels of coarseness). A natural problem then is to analyse and compare the (not necessarily hierarchical) sequences of partitions that underpin such multiscale descriptions. Here, we use tools from topological data analysis and introduce the Multiscale Clustering Filtration (MCF), a well-defined and stable filtration of abstract simplicial complexes that encodes arbitrary cluster assignments in a sequence of partitions across scales of increasing coarseness. We show that the zero-dimensional persistent homology of the MCF measures the degree of hierarchy of this sequence, and the higher-dimensional persistent homology tracks the emergence and resolution of conflicts between cluster assignments across the sequence of partitions. To broaden the theoretical foundations of the MCF, we provide an equivalent construction via a nerve complex filtration, and we show that, in the hierarchical case, the MCF reduces to a Vietoris-Rips filtration of an ultrametric space. Using synthetic data, we then illustrate how the persistence diagram of the MCF provides a feature map that can serve to characterise and classify multiscale clusterings.

Analysing Multiscale Clusterings with Persistent Homology

TL;DR

and detect cross-scale conflicts via higher-dimensional

, with a nerve-based construction and a hierarchical filtration that reduces to a VR filtration on ultrametric spaces. Through synthetic SBMs, the authors show PDs provide a compact, discriminative feature map for multiscale structure and demonstrate robustness via Wasserstein distances on diagrams. The work bridges cluster analysis with TDA, enabling comparison and learning on non-hierarchical multiscale clusterings and offering practical implementations and future directions for stability and scalability.

Abstract

Paper Structure (29 sections, 13 theorems, 42 equations, 7 figures)

This paper contains 29 sections, 13 theorems, 42 equations, 7 figures.

Introduction
Approach and Contributions
Organisation of the Article
Theoretical Background
Multiscale Clustering
Partitions of a Set, Refinement, Hierarchy
Multiscale Clustering
Persistent homology
Simplicial Complex
Simplicial Homology
Filtrations
Persistent Homology
Persistence Diagrams
Distance Measures for PDs
Multiscale Clustering Filtration
...and 14 more sections

Key Result

Proposition 2

The MCF $\mathcal{M}=(K^{t})_{t \ge t_1}$ is a filtration of abstract simplicial complexes.

Figures (7)

Figure 1: These simple examples illustrate how the persistence diagram (PD) of the Multiscale Clustering Filration (MCF) summarises the properties of multiscale sequences of partitions. A For a hierarchical dendrogram on 16 data points (visualised here as a Sankey diagram at scales $t=0, \ldots, 6$), the MCF is equivalent to a Vietoris-Rips filtration on the corresponding ultrametric space (see Corollary \ref{['cor:hierarchical_cag_equivalent']}); hence its PD has only zero-dimensional invariants (indicated by red circles, with the number of overlapping circles indicated) which count the merges in the dendrogram (see Corollary \ref{['cor:higher_dim_ph_0_when_hierarchical']}). B For a non-hierarchical multiscale clustering (for which the Sankey diagram has non-trivial crossings), the MCF captures the emergence of conflicts between cluster assignments at scales $t=3$ and $t=4$ through the birth of one-dimensional invariants in the PD (blue points) and the resolution of these conflicts at $t=5$ through the death of the invariants (see Remark \ref{['rem:k-conflicts']} and Proposition \ref{['prop:kconflicts-intersections']}). C For a complex non-hierarchical multiscale clustering on 270 points, the PD of the MCF provides a concise description in terms of births and deaths of invariants of different dimensions. Our numerical experiments in Section \ref{['sec:NumericalExperiments']} show that the Wasserstein distance between PDs of structurally similar sequences of partitions is small (see Figure \ref{['fig:model_comparison']}), hence the PDs can be used as feature maps to characterise and classify multiscale clusterings.
Figure 2: MCF construction. Illustration of the MCF on a set of three points $X=\{x_1,x_2,x_3\}$ as per Example \ref{['ex:MCF_illustration']}. The top row shows the non-hierarchical sequence of partitions $\theta:[1,\infty)\rightarrow\Pi_X, t \mapsto \theta(t)$ (and a corresponding Sankey diagram) which emerges from evaluation at the critical values $\theta(t_i)=\mathcal{P}^i$ for $t_i:=i$, $i=1,..5$. The bottom row shows the filtered simplicial complex $(K^t)_{1\le t\le 5}$. The first non-hierarchy in the sequence of partitions appears at filtration index $t=3$, when the number of clusters in $\theta(3)$ is for the first time larger than the number of connected components in $K^3$, leading to a so-called 0-conflict, see Example \ref{['Ex:running_hierarchy']}. At filtration index $t=4$, the three elements $x_1$, $x_2$ and $x_3$ are in a so-called 1-conflict emerging of three different cluster assignments that produce a non-bounding 1-cycle $[x_1,x_2]+[x_2,x_3]+[x_3,x_1]$, see Example \ref{['ex:1_conflict']}. Both kinds of conflicts are resolved at index $t=5$ when the 2-simplex $[x_1,x_2,x_3]$ is added to $K^5$, making $\theta(5)$ a conflict-resolving partition, see Remark \ref{['rem:heuristic_gaps']}.
Figure 3: Sankey diagrams for multiscale clusterings of realisations of different SBMs. i)-iv) For each SBM model (ER, sSBM, mSBM, nh-mSBM), we present the adjacency matrix of a realisation and the corresponding (non-hierarchical) sequence of partitions $\theta_i:[-1.5,0.5]\rightarrow \Pi_V, t\mapsto\theta_i(t)$ obtained with Markov Stability (MS) and visualised using a Sankey diagram.
Figure 4: Pairwise comparison between models. Pairwise distances between all $i=1,...,800$ model realisations of the four ensembles (ER, sSBM, mSBM and nh-mSBM): (A) Frobenius distance of adjacency matrices $||A_i-A_j||_F$; (B) 2-Wasserstein distance of the zero-dimensional PDs $d_{W,2}(\mathop{\mathrm{Dgm}}\nolimits_0(\mathcal{M}_i, \mathop{\mathrm{Dgm}}\nolimits_0(\mathcal{M}_j)$; (C) 2-Wasserstein distance of the one-dimensional PDs $d_{W,2}(\mathop{\mathrm{Dgm}}\nolimits_1(\mathcal{M}_i, \mathop{\mathrm{Dgm}}\nolimits_1(\mathcal{M}_j)$. Whereas the Frobenius distance (A) is not able to distinguish the models, the 2-Wasserstein distance of zero-dimensional PDs (B) distinguishes the models based on their hierarchical structure and the 2-Wasserstein distance of one-dimensional PDs (C) distinguishes the models based on their multiscale structure.
Figure 5: MCF persistence diagrams, persistent hierarchy and persistent conflict for different models. i)-iv) For each model (ER, sSBM, mSBM, nh-mSBM), we compute: the ensemble PD of all 200 samples from each model (top row); the average one- and two-dimensional Betti curves and persistent conflict $c(t)$\ref{['eq:persistent_conflict_total']} with 95% confidence intervals (middle row); the average zero-dimensional Betti curve, number of clusters and persistent hierarchy $h(t)$\ref{['eq:persistent_hierachy']} with 95% confidence intervals (bottom row). The gaps in the PDs of sSBM, mSBM and nh-mSBM, which are also linked with plateaux after dips in $c(t)$, indicate conflict-resolving partitions and correspond well with ground-truth planted partitions at different scales (shaded in pink) identified from the data (Figure \ref{['S_fig:ensembles_nvi']}). In contrast, no gaps or plateaux are present for the ER model, confirming its lack of robust partitions. The persistent hierarchy $h(t)$ is highest for mSBM and lowest for the ER model.
...and 2 more figures

Theorems & Definitions (55)

Definition 1: Multiscale Clustering Filtration
Proposition 2
proof
Remark 3
Remark 4
Remark 5
Example 6: Running example
Remark 7: Ordering of sequence of partitions
Remark 8
Remark 9: Stability of MCF
...and 45 more

Analysing Multiscale Clusterings with Persistent Homology

TL;DR

Abstract

Analysing Multiscale Clusterings with Persistent Homology

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (55)