Table of Contents
Fetching ...

A distribution-guided Mapper algorithm

Yuyang Tao, Shufei Ge

TL;DR

This work addresses the limitation of fixed interval covers in Mapper by introducing D-Mapper, a density-guided variant that uses a mixture-model fit to projected data to generate flexible, overlapping intervals via $1-\alpha$ confidence intervals. It couples this with a novel evaluation framework combining clustering quality and topological fidelity through $SC$, $TSR$, and the adjusted metric $SC_{adj}$, enhanced by bottleneck bootstrap confidence sets. Across synthetic examples and a SARS-CoV-2 RNA dataset, D-Mapper consistently improves the topological-aware clustering measure $SC_{adj}$ while maintaining or improving topology preservation, and it captures both vertical and horizontal evolutionary processes in the viral data. The approach demonstrates the value of integrating probabilistic density modeling with Mapper to reveal richer data shapes and informs future extensions like nonparametric mixtures and information-criterion-based component selection.

Abstract

Motivation: The Mapper algorithm is an essential tool to explore shape of data in topology data analysis. With a dataset as an input, the Mapper algorithm outputs a graph representing the topological features of the whole dataset. This graph is often regarded as an approximation of a reeb graph of data. The classic Mapper algorithm uses fixed interval lengths and overlapping ratios, which might fail to reveal subtle features of data, especially when the underlying structure is complex. Results: In this work, we introduce a distribution guided Mapper algorithm named D-Mapper, that utilizes the property of the probability model and data intrinsic characteristics to generate density guided covers and provides enhanced topological features. Our proposed algorithm is a probabilistic model-based approach, which could serve as an alternative to non-prababilistic ones. Moreover, we introduce a metric accounting for both the quality of overlap clustering and extended persistence homology to measure the performance of Mapper type algorithm. Our numerical experiments indicate that the D-Mapper outperforms the classical Mapper algorithm in various scenarios. We also apply the D-Mapper to a SARS-COV-2 coronavirus RNA sequences dataset to explore the topological structure of different virus variants. The results indicate that the D-Mapper algorithm can reveal both vertical and horizontal evolution processes of the viruses. Availability: Our package is available at https://github.com/ShufeiGe/D-Mapper.

A distribution-guided Mapper algorithm

TL;DR

This work addresses the limitation of fixed interval covers in Mapper by introducing D-Mapper, a density-guided variant that uses a mixture-model fit to projected data to generate flexible, overlapping intervals via confidence intervals. It couples this with a novel evaluation framework combining clustering quality and topological fidelity through , , and the adjusted metric , enhanced by bottleneck bootstrap confidence sets. Across synthetic examples and a SARS-CoV-2 RNA dataset, D-Mapper consistently improves the topological-aware clustering measure while maintaining or improving topology preservation, and it captures both vertical and horizontal evolutionary processes in the viral data. The approach demonstrates the value of integrating probabilistic density modeling with Mapper to reveal richer data shapes and informs future extensions like nonparametric mixtures and information-criterion-based component selection.

Abstract

Motivation: The Mapper algorithm is an essential tool to explore shape of data in topology data analysis. With a dataset as an input, the Mapper algorithm outputs a graph representing the topological features of the whole dataset. This graph is often regarded as an approximation of a reeb graph of data. The classic Mapper algorithm uses fixed interval lengths and overlapping ratios, which might fail to reveal subtle features of data, especially when the underlying structure is complex. Results: In this work, we introduce a distribution guided Mapper algorithm named D-Mapper, that utilizes the property of the probability model and data intrinsic characteristics to generate density guided covers and provides enhanced topological features. Our proposed algorithm is a probabilistic model-based approach, which could serve as an alternative to non-prababilistic ones. Moreover, we introduce a metric accounting for both the quality of overlap clustering and extended persistence homology to measure the performance of Mapper type algorithm. Our numerical experiments indicate that the D-Mapper outperforms the classical Mapper algorithm in various scenarios. We also apply the D-Mapper to a SARS-COV-2 coronavirus RNA sequences dataset to explore the topological structure of different virus variants. The results indicate that the D-Mapper algorithm can reveal both vertical and horizontal evolution processes of the viruses. Availability: Our package is available at https://github.com/ShufeiGe/D-Mapper.
Paper Structure (15 sections, 1 theorem, 6 equations, 9 figures, 4 tables, 4 algorithms)

This paper contains 15 sections, 1 theorem, 6 equations, 9 figures, 4 tables, 4 algorithms.

Key Result

Theorem 1

Let $\mathcal{U} = (u_i),~i \in I$ be an open cover of a paracompact space $X$ by open sets such that the intersection of any sub-collection of the $u_i$’s is either empty or contractible. Then, $X$ and the nerve C($\mathcal{U}$) are homotopy equivalent.

Figures (9)

  • Figure 1: An illustration of intervals produced by the D-Mapper algorithm. The deep blue line represents the probability density function of each component in the GMM. The shallow blue dashed line presents the probability density function of the GMM. The orange lines are intervals that are produced naturally given a confidence level of $1-\alpha$. (a) When $\alpha = 0.01$, there are overlaps between adjacent intervals. (b) When $\alpha = 0.1$, there is a gap between the first and second intervals. $\alpha$ controls the overlap of intervals, and it should be chosen carefully.
  • Figure 2: An example of the classic Mapper algorithm on the same dataset but outputs different graphs. The dataset is shown in Figure \ref{['cirs_results']} (a), the classic Mapper is implemented, and the clustering algorithm is DBSCAN with a radius of $0.5$ and a minimum of samples $3$. (a) The output graph when $n=12, p=0.01$ and its $SC = 0.283$, $SC_{adj}=0.521$. This graph has a higher $SC$ but a poor topological structure. (b) The output graph when $n=12, p=0.1$ and its $SC = 0.246$, $SC_{adj}=0.812$. This graph has a lower $SC$ but a good topological structure.
  • Figure 3: An example of the extended persistence diagram. There are a total of $8$ points in the diagram; the gray area is computed by the bottleneck bootstrap, and points inside this area are noise. Thus, $6$ points are noises, $2$ points are signals, and the $TSR$ is $0.25$.
  • Figure 4: Results of the classic Mapper and D-Mapper on the two disjoint circles. (a) The output graph of the classic Mapper with the largest $SC_{adj}$ (the $1$st row of Table \ref{['tab1']}): $n=12, p = 0.02$. (b) The output graph of the D-Mapper with the largest $SC_{adj}$ (the $2$nd row of Table \ref{['tab1']}): $n=12, \alpha = 0.127$. (c) An example produced by the classic Mapper with larger $SC$ but lower $TSR$ (the $3$rd row of Table \ref{['tab1']}): $n=12, p = 0.005$. (d) An example produced by the D-Mapper with larger $SC$ but lower $TSR$ (the $4$th row of Table \ref{['tab1']}): $n=12, \alpha = 0.159$.
  • Figure 5: Results of the classic Mapper and D-Mapper on the two intersecting circles. (a) The output graph of the classic Mapper with the largest $SC_{adj}$ (the $1$st row of Table \ref{['tab2']}): $n=8, p = 0.02$. (b) The output graph of the D-Mapper with the largest $SC_{adj}$ (the $2$nd row of Table \ref{['tab2']}): $n=8, \alpha = 0.088$. (c) An example produced by the classic Mapper with larger $SC$ but lower $TSR$ (the $3$rd row of Table \ref{['tab2']}): $n=8, p = 0.02$. (d) An example produced by the D-Mapper with larger $SC$ but lower $TSR$ (the $4$th row of Table \ref{['tab2']}): $n=8, \alpha = 0.12$.
  • ...and 4 more figures

Theorems & Definitions (7)

  • Definition 1: Simplex
  • Definition 2: Geometric simplicial complex
  • Definition 3: Abstract simplicial complex
  • Definition 4: Open cover
  • Theorem 1: Nerve theorem
  • Definition 5: Topological signal rate
  • Definition 6: Adjusted silhouette coefficient