Topological Quality of Subsets via Persistence Matching Diagrams

Álvaro Torras-Casas; Eduardo Paluzo-Hidalgo; Rocio Gonzalez-Diaz

Topological Quality of Subsets via Persistence Matching Diagrams

Álvaro Torras-Casas, Eduardo Paluzo-Hidalgo, Rocio Gonzalez-Diaz

TL;DR

The persistence matching diagram is defined, a topological invariant derived from combining embeddings with persistent homology that allows us to understand whether the subset ``represents well" the clusters from the larger dataset or not, and to estimate bounds for the Hausdorff distance between the subset and the complete dataset.

Abstract

Data quality is crucial for the successful training, generalization and performance of machine learning models. We propose to measure the quality of a subset concerning the dataset it represents, using topological data analysis techniques. Specifically, we define the persistence matching diagram, a topological invariant derived from combining embeddings with persistent homology. We provide an algorithm to compute it using minimum spanning trees. Also, the invariant allows us to understand whether the subset ``represents well" the clusters from the larger dataset or not, and we also use it to estimate bounds for the Hausdorff distance between the subset and the complete dataset. In particular, this approach enables us to explain why the chosen subset is likely to result in poor performance of a supervised learning model.

Topological Quality of Subsets via Persistence Matching Diagrams

TL;DR

Abstract

Paper Structure (24 sections, 6 theorems, 28 equations, 7 figures, 3 tables)

This paper contains 24 sections, 6 theorems, 28 equations, 7 figures, 3 tables.

Introduction
Block functions between barcodes induced by inclusion maps
Background
Finite metric spaces and the Hausdorff distance
The 1-skeleton of the Vietoris-Rips filtration, $\mathop{\mathrm{VR}}\nolimits(Z)$
The 0-dimensional homology group of $\mathop{\mathrm{VR}}\nolimits_r(Z)$
The 0-dimensional persistent homology of $\mathop{\mathrm{VR}}\nolimits(Z)$ and merge trees
Barcodes, persistence diagrams and multisets
From inclusion maps to block functions: the induced block function $\mathcal{M}_{ X}^{ Z}$
Matching diagrams for topological quality of subsets
Interpreting $D(X,Z)$ to study the relation between $X$ and $Z$
Hausdorff distance bounds from matching diagrams
Methodology and applications
ML preliminaries
Experiments
...and 9 more sections

Key Result

Proposition 3.2

Fixed $r\geq 0$, there are $\sum_{r < b} m^{D}((\infty, b))$ components in $\mathop{\mathrm{VR}}\nolimits_r(Z)$ that contain no points from $X$. In particular, all components from $\mathop{\mathrm{VR}}\nolimits_r(Z)$ contain points from $X$ for all $r \geq \eta_f$. Besides, if $\eta_f=0$ then $X=Z$.

Figures (7)

Figure 1: Vietoris-Rips filtration of a set $Z$ (top row) and a subset $X$ (bottom row) with connected components indexed by non-negative integers.
Figure 2: Merge tree representation for $X$ and $Z$. Triplets $(z_j,b_j,z_i)$ are plotted as a blue horizontal segment labeled by $j$ on the left and $i$ on the right, followed by a red vertical segment that connects to the blue horizontal segments labeled on the right by $i$. In addition, on top of the representation, we plot a horizontal blue line corresponding to the component $[z_0]$, which never dies. The blue horizontal intervals compound the barcodes of $\mathop{\mathrm{PH}}\nolimits_0(X)$ (merge tree representation on the left) and $\mathop{\mathrm{PH}}\nolimits_0(Z)$ (merge tree representation on the right).
Figure 3: Depiction of the set $S^{ D}$ associated to the matching diagram $D$ detailed in Example \ref{['ex:diagram-matching']}.
Figure 4: On the left (resp. on the right): Depiction of the matching diagram $D({\cal H})$ (resp. $D({\cal B})$) associated to the housing dataset $Z^{\cal H}$ (reps. $Z^{\cal B}$) and a random subset $X^{ {\cal H}}$ (resp. $X^{ {\cal B}}$). The axes are scaled differently for each dataset.
Figure 5: Housing dataset. Representation of the matching diagram $D({\cal H}(i))$ for Class $i$ for $i\in [\![3]\!]$. The axes are scaled differently for each class.
...and 2 more figures

Theorems & Definitions (22)

Example 2.1
Example 2.2
Example 2.3
Example 2.4
remark 1
Example 2.5
remark 2
Example 2.6
Example 3.1
remark 3
...and 12 more

Topological Quality of Subsets via Persistence Matching Diagrams

TL;DR

Abstract

Topological Quality of Subsets via Persistence Matching Diagrams

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (22)