Table of Contents
Fetching ...

A Mapper Algorithm with implicit intervals and its optimization

Yuyang Tao, Shufei Ge

TL;DR

This work addresses limitations of the Mapper algorithm related to fixed interval covers and parameter tuning by introducing a probabilistic Soft Mapper with implicit intervals defined by a hidden assignment matrix. It uses a Gaussian Mixture Model to derive a row-wise assignment probability $Q$ and samples a Mapper graph via a multinomial scheme, while also defining a Mapper graph mode for a robust point estimate. A persistence-informed topological loss is combined with negative log-likelihood and optimized with stochastic gradient descent, yielding graphs that better capture underlying topology in noisy data. The method demonstrates competitive or improved topological fidelity on synthetic datasets and identifies a distinct Alzheimer's-related subgroup in an MSBB RNA-expression dataset, highlighting practical utility in biomedical topology analysis.

Abstract

The Mapper algorithm is an essential tool for visualizing complex, high dimensional data in topology data analysis (TDA) and has been widely used in biomedical research. It outputs a combinatorial graph whose structure implies the shape of the data. However,the need for manual parameter tuning and fixed intervals, along with fixed overlapping ratios may impede the performance of the standard Mapper algorithm. Variants of the standard Mapper algorithms have been developed to address these limitations, yet most of them still require manual tuning of parameters. Additionally, many of these variants, including the standard version found in the literature, were built within a deterministic framework and overlooked the uncertainty inherent in the data. To relax these limitations, in this work, we introduce a novel framework that implicitly represents intervals through a hidden assignment matrix, enabling automatic parameter optimization via stochastic gradient descent. In this work, we develop a soft Mapper framework based on a Gaussian mixture model(GMM) for flexible and implicit interval construction. We further illustrate the robustness of the soft Mapper algorithm by introducing the Mapper graph mode as a point estimation for the output graph. Moreover, a stochastic gradient descent algorithm with a specific topological loss function is proposed for optimizing parameters in the model. Both simulation and application studies demonstrate its effectiveness in capturing the underlying topological structures. In addition, the application to an RNA expression dataset obtained from the Mount Sinai/JJ Peters VA Medical Center Brain Bank (MSBB) successfully identifies a distinct subgroup of Alzheimer's Disease.

A Mapper Algorithm with implicit intervals and its optimization

TL;DR

This work addresses limitations of the Mapper algorithm related to fixed interval covers and parameter tuning by introducing a probabilistic Soft Mapper with implicit intervals defined by a hidden assignment matrix. It uses a Gaussian Mixture Model to derive a row-wise assignment probability and samples a Mapper graph via a multinomial scheme, while also defining a Mapper graph mode for a robust point estimate. A persistence-informed topological loss is combined with negative log-likelihood and optimized with stochastic gradient descent, yielding graphs that better capture underlying topology in noisy data. The method demonstrates competitive or improved topological fidelity on synthetic datasets and identifies a distinct Alzheimer's-related subgroup in an MSBB RNA-expression dataset, highlighting practical utility in biomedical topology analysis.

Abstract

The Mapper algorithm is an essential tool for visualizing complex, high dimensional data in topology data analysis (TDA) and has been widely used in biomedical research. It outputs a combinatorial graph whose structure implies the shape of the data. However,the need for manual parameter tuning and fixed intervals, along with fixed overlapping ratios may impede the performance of the standard Mapper algorithm. Variants of the standard Mapper algorithms have been developed to address these limitations, yet most of them still require manual tuning of parameters. Additionally, many of these variants, including the standard version found in the literature, were built within a deterministic framework and overlooked the uncertainty inherent in the data. To relax these limitations, in this work, we introduce a novel framework that implicitly represents intervals through a hidden assignment matrix, enabling automatic parameter optimization via stochastic gradient descent. In this work, we develop a soft Mapper framework based on a Gaussian mixture model(GMM) for flexible and implicit interval construction. We further illustrate the robustness of the soft Mapper algorithm by introducing the Mapper graph mode as a point estimation for the output graph. Moreover, a stochastic gradient descent algorithm with a specific topological loss function is proposed for optimizing parameters in the model. Both simulation and application studies demonstrate its effectiveness in capturing the underlying topological structures. In addition, the application to an RNA expression dataset obtained from the Mount Sinai/JJ Peters VA Medical Center Brain Bank (MSBB) successfully identifies a distinct subgroup of Alzheimer's Disease.

Paper Structure

This paper contains 16 sections, 24 equations, 18 figures, 3 tables, 1 algorithm.

Figures (18)

  • Figure 1: A demonstration of the Mapper algorithm applied to a dataset with a cross structure. (a) A visualization of the dataset. (b) The projected data and its overlapped intervals when $K=6$, $p=0.33$. (c) The output graph of the Mapper algorithm. The clustering algorithm implemented here is the DBSCAN$(\epsilon = 0.6, minPts = 5)$DBSCAN. The output Mapper graph presents a cross shape, which is consistent with the shape of the dataset.
  • Figure 2: Comparison of the standard Mapper algorithm, D-Mapper algorithm and our proposed algorithm on a two disjoint circles dataset and a two intersecting circles dataset. (a, f). A visualization of the datasets. Coloured lines represents the (implicit) intervals produced by each algorithm. (b, g). The output graphs of the standard Mapper algorithm with $K=6, p=0.33$ for the two disjoint circles and $K=5, p=0.2$ for the two intersecting circles. (c, h). The output graphs of D-Mapper algorithm with $\alpha = 0.08$ for the disjoint circles and $\alpha = 0.31$ for the intersecting circles. (d, i). The output Mapper graph mode of our proposed method without optimization. (e, j). The output Mapper graph mode of our proposed method with optimization.
  • Figure 3: Comparison of the standard Mapper algorithm, D-Mapper algorithm and our proposed algorithm on a two unequal-sized disjoint circles dataset and a two unequal-sized intersecting circles dataset. (a, f). A visualization of the datasets. Coloured lines represents the (implicit) intervals produced by each algorithm. (b, g). The output graphs of the standard Mapper algorithm with $K=6, p=0.3$ for the disjoint circles and $K=5, p=0.2$ for the intersecting circles. (c, h). The output graphs of D-Mapper algorithm with $\alpha = 0.06$ for the disjoint circles and $\alpha = 0.05$ for the intersecting circles. (d, i). The output Mapper graph mode of our proposed method without optimization. (e, j). The output Mapper graph mode of our proposed method with optimization.
  • Figure 4: Comparison of the standard Mapper algorithm, D-Mapper algorithm and our proposed algorithm on a two intersecting circles dataset with noises. (a, f). A visualization of the datasets. Coloured lines represents the (implicit) intervals produced by each algorithm. (b, g). The output graphs of the standard Mapper algorithm with $K=6, p=0.4$ for the small noise dataset and $K=6, p=0.4$ for the big noise dataset. (c, h). The output graphs of D-Mapper algorithm with $\alpha = 0.025$ for the small noise dataset and $\alpha = 0.07$ for the big noise dataset. (d, i). The output Mapper graph mode of our proposed method without optimization. (e, j). The output Mapper graph mode of our proposed method with optimization.
  • Figure 5: Comparison of the standard Mapper algorithm, D-Mapper and our proposed algorithm on a 3D human dataset. (a) A visualization of the 3D human dataset. (b) The projected data and resulted (implicit) intervals of each algorithm. (c) The output graph of the standard Mapper algorithm, the number of intervals is $8$ and the overlap rate is $0.1$. (d) The output graph of D-Mapper algorithm, the parameter $\alpha = 0.071$. (e) The Mapper graph mode without optimization. (f) The Mapper graph mode with optimization.
  • ...and 13 more figures

Theorems & Definitions (5)

  • Definition 1: Mapper function
  • Definition 2: Soft Mapper with a Bernoulli distribution
  • Definition 3: GMM Soft Mapper
  • Definition 4: Soft Mapper with a multinomial distribution
  • Definition 5: The mode of a soft Mapper with a multinomial distribution