Table of Contents
Fetching ...

G-Mapper: Learning a Cover in the Mapper Construction

Enrique Alvarado, Robin Belton, Emily Fischer, Kang-Ju Lee, Sourabh Palande, Sarah Percival, Emilie Purvine

TL;DR

G-Mapper addresses the challenge of tuning the Mapper cover by adaptively splitting cover elements using a Gaussian mixture model guided by the Anderson–Darling normality test, avoiding the need for an initialized open cover. The method leverages $A_*^2$ to decide splits and uses $m_1,m_2$ and $\sigma_1,\sigma_2$ with overlap $g\_overlap$ to form two overlapping intervals, producing data-faithful Mapper graphs with improved runtimes. Empirical results on synthetic and real-world datasets show that G-Mapper captures essential data structure and often outperforms Multipass BIC, F-Mapper, and balanced covers in both quality metrics (e.g., Silhouette) and speed, while remaining applicable to high-dimensional settings. The work provides open-source code and demonstrates that the produced interval counts can serve as informative inputs to other Mapper variants, enhancing practical usability for topological data analysis.

Abstract

The Mapper algorithm is a visualization technique in topological data analysis (TDA) that outputs a graph reflecting the structure of a given dataset. However, the Mapper algorithm requires tuning several parameters in order to generate a ``nice" Mapper graph. This paper focuses on selecting the cover parameter. We present an algorithm that optimizes the cover of a Mapper graph by splitting a cover repeatedly according to a statistical test for normality. Our algorithm is based on G-means clustering which searches for the optimal number of clusters in $k$-means by iteratively applying the Anderson-Darling test. Our splitting procedure employs a Gaussian mixture model to carefully choose the cover according to the distribution of the given data. Experiments for synthetic and real-world datasets demonstrate that our algorithm generates covers so that the Mapper graphs retain the essence of the datasets, while also running significantly faster than a previous iterative method.

G-Mapper: Learning a Cover in the Mapper Construction

TL;DR

G-Mapper addresses the challenge of tuning the Mapper cover by adaptively splitting cover elements using a Gaussian mixture model guided by the Anderson–Darling normality test, avoiding the need for an initialized open cover. The method leverages to decide splits and uses and with overlap to form two overlapping intervals, producing data-faithful Mapper graphs with improved runtimes. Empirical results on synthetic and real-world datasets show that G-Mapper captures essential data structure and often outperforms Multipass BIC, F-Mapper, and balanced covers in both quality metrics (e.g., Silhouette) and speed, while remaining applicable to high-dimensional settings. The work provides open-source code and demonstrates that the produced interval counts can serve as informative inputs to other Mapper variants, enhancing practical usability for topological data analysis.

Abstract

The Mapper algorithm is a visualization technique in topological data analysis (TDA) that outputs a graph reflecting the structure of a given dataset. However, the Mapper algorithm requires tuning several parameters in order to generate a ``nice" Mapper graph. This paper focuses on selecting the cover parameter. We present an algorithm that optimizes the cover of a Mapper graph by splitting a cover repeatedly according to a statistical test for normality. Our algorithm is based on G-means clustering which searches for the optimal number of clusters in -means by iteratively applying the Anderson-Darling test. Our splitting procedure employs a Gaussian mixture model to carefully choose the cover according to the distribution of the given data. Experiments for synthetic and real-world datasets demonstrate that our algorithm generates covers so that the Mapper graphs retain the essence of the datasets, while also running significantly faster than a previous iterative method.
Paper Structure (27 sections, 6 equations, 17 figures, 2 tables)

This paper contains 27 sections, 6 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Mapper Construction. The dataset $X$ consists of points sampled from a circle of radius $1/2$ with center at $(1/2,1/2)$. Constructing a Mapper graph requires selecting a lens function $f$ (Figure \ref{['fig:filter']}) and cover $\{U_i\}$ (Figure \ref{['fig:cover']}), and applying a clustering algorithm to $\{f^{-1}(U_i)\}$ (Figure \ref{['fig:clustering']}). The parameters are specified in Example \ref{['ex:mapper']}, and the generated Mapper graph is a cycle graph with four vertices and four edges (Figure \ref{['fig:Mapper']}).
  • Figure 1: The point $\bigstar$ divides the two means, $m_1$ and $m_2$, in the ratio $\sigma_1:\sigma_2$, where $\sigma_1$ and $\sigma_2$ are the standard deviations. Two intervals are created by extending the distance between $m_1$ (or $m_2$) and $\bigstar$ by g_overlap , respectively.
  • Figure 1: Two Circles Dataset. G-Mapper Parameters: AD threshold = 10, g_overlap = 0.1, clustering algorithm = DBSCAN with $\varepsilon=0.1$ and MinPts = 5, and search method = DFS. The cover was found in 7 iterations and consists of 8 intervals. Reference Mapper Parameters: number of intervals = 7, overlap = 0.2, and the same DBSCAN parameters.
  • Figure 2: Anderson-Darling statistics and a Gaussian mixture model. Figure \ref{['fig:1d_data']}. The histogram of a $1$-dimensional dataset whose AD statistic is $53.85$. Figure \ref{['fig:Two_subsets']}. The GMM is applied to the dataset and the AD statistics of the left and right sides are much smaller than the AD statistic of the entire dataset.
  • Figure 2: G-Mapper. The initialization and first two iterations of the splitting procedure are represented in Figure \ref{['fig:iter_0']}, Figure \ref{['fig:iter_1']}, and Figure \ref{['fig:iter_2']} respectively. The cover, the pre-images of the cover elements, and the corresponding Mapper graph are located on the lower left side, the upper left side, and the right side, respectively. The final Mapper graph is a cycle graph with $4$ vertices.
  • ...and 12 more figures

Theorems & Definitions (3)

  • Example 2.1: Mapper Graph
  • Example 2.2: AD Statistic and GMM
  • Example 3.1: The G-Mapper Algorithm