A distribution-guided Mapper algorithm
Yuyang Tao, Shufei Ge
TL;DR
This work addresses the limitation of fixed interval covers in Mapper by introducing D-Mapper, a density-guided variant that uses a mixture-model fit to projected data to generate flexible, overlapping intervals via $1-\alpha$ confidence intervals. It couples this with a novel evaluation framework combining clustering quality and topological fidelity through $SC$, $TSR$, and the adjusted metric $SC_{adj}$, enhanced by bottleneck bootstrap confidence sets. Across synthetic examples and a SARS-CoV-2 RNA dataset, D-Mapper consistently improves the topological-aware clustering measure $SC_{adj}$ while maintaining or improving topology preservation, and it captures both vertical and horizontal evolutionary processes in the viral data. The approach demonstrates the value of integrating probabilistic density modeling with Mapper to reveal richer data shapes and informs future extensions like nonparametric mixtures and information-criterion-based component selection.
Abstract
Motivation: The Mapper algorithm is an essential tool to explore shape of data in topology data analysis. With a dataset as an input, the Mapper algorithm outputs a graph representing the topological features of the whole dataset. This graph is often regarded as an approximation of a reeb graph of data. The classic Mapper algorithm uses fixed interval lengths and overlapping ratios, which might fail to reveal subtle features of data, especially when the underlying structure is complex. Results: In this work, we introduce a distribution guided Mapper algorithm named D-Mapper, that utilizes the property of the probability model and data intrinsic characteristics to generate density guided covers and provides enhanced topological features. Our proposed algorithm is a probabilistic model-based approach, which could serve as an alternative to non-prababilistic ones. Moreover, we introduce a metric accounting for both the quality of overlap clustering and extended persistence homology to measure the performance of Mapper type algorithm. Our numerical experiments indicate that the D-Mapper outperforms the classical Mapper algorithm in various scenarios. We also apply the D-Mapper to a SARS-COV-2 coronavirus RNA sequences dataset to explore the topological structure of different virus variants. The results indicate that the D-Mapper algorithm can reveal both vertical and horizontal evolution processes of the viruses. Availability: Our package is available at https://github.com/ShufeiGe/D-Mapper.
