Table of Contents
Fetching ...

Topological Quality of Subsets via Persistence Matching Diagrams

Álvaro Torras-Casas, Eduardo Paluzo-Hidalgo, Rocio Gonzalez-Diaz

TL;DR

The persistence matching diagram is defined, a topological invariant derived from combining embeddings with persistent homology that allows us to understand whether the subset ``represents well" the clusters from the larger dataset or not, and to estimate bounds for the Hausdorff distance between the subset and the complete dataset.

Abstract

Data quality is crucial for the successful training, generalization and performance of machine learning models. We propose to measure the quality of a subset concerning the dataset it represents, using topological data analysis techniques. Specifically, we define the persistence matching diagram, a topological invariant derived from combining embeddings with persistent homology. We provide an algorithm to compute it using minimum spanning trees. Also, the invariant allows us to understand whether the subset ``represents well" the clusters from the larger dataset or not, and we also use it to estimate bounds for the Hausdorff distance between the subset and the complete dataset. In particular, this approach enables us to explain why the chosen subset is likely to result in poor performance of a supervised learning model.

Topological Quality of Subsets via Persistence Matching Diagrams

TL;DR

The persistence matching diagram is defined, a topological invariant derived from combining embeddings with persistent homology that allows us to understand whether the subset ``represents well" the clusters from the larger dataset or not, and to estimate bounds for the Hausdorff distance between the subset and the complete dataset.

Abstract

Data quality is crucial for the successful training, generalization and performance of machine learning models. We propose to measure the quality of a subset concerning the dataset it represents, using topological data analysis techniques. Specifically, we define the persistence matching diagram, a topological invariant derived from combining embeddings with persistent homology. We provide an algorithm to compute it using minimum spanning trees. Also, the invariant allows us to understand whether the subset ``represents well" the clusters from the larger dataset or not, and we also use it to estimate bounds for the Hausdorff distance between the subset and the complete dataset. In particular, this approach enables us to explain why the chosen subset is likely to result in poor performance of a supervised learning model.
Paper Structure (24 sections, 6 theorems, 28 equations, 7 figures, 3 tables)

This paper contains 24 sections, 6 theorems, 28 equations, 7 figures, 3 tables.

Key Result

Proposition 3.2

Fixed $r\geq 0$, there are $\sum_{r < b} m^{D}((\infty, b))$ components in $\mathop{\mathrm{VR}}\nolimits_r(Z)$ that contain no points from $X$. In particular, all components from $\mathop{\mathrm{VR}}\nolimits_r(Z)$ contain points from $X$ for all $r \geq \eta_f$. Besides, if $\eta_f=0$ then $X=Z$.

Figures (7)

  • Figure 1: Vietoris-Rips filtration of a set $Z$ (top row) and a subset $X$ (bottom row) with connected components indexed by non-negative integers.
  • Figure 2: Merge tree representation for $X$ and $Z$. Triplets $(z_j,b_j,z_i)$ are plotted as a blue horizontal segment labeled by $j$ on the left and $i$ on the right, followed by a red vertical segment that connects to the blue horizontal segments labeled on the right by $i$. In addition, on top of the representation, we plot a horizontal blue line corresponding to the component $[z_0]$, which never dies. The blue horizontal intervals compound the barcodes of $\mathop{\mathrm{PH}}\nolimits_0(X)$ (merge tree representation on the left) and $\mathop{\mathrm{PH}}\nolimits_0(Z)$ (merge tree representation on the right).
  • Figure 3: Depiction of the set $S^{ D}$ associated to the matching diagram $D$ detailed in Example \ref{['ex:diagram-matching']}.
  • Figure 4: On the left (resp. on the right): Depiction of the matching diagram $D({\cal H})$ (resp. $D({\cal B})$) associated to the housing dataset $Z^{\cal H}$ (reps. $Z^{\cal B}$) and a random subset $X^{ {\cal H}}$ (resp. $X^{ {\cal B}}$). The axes are scaled differently for each dataset.
  • Figure 5: Housing dataset. Representation of the matching diagram $D({\cal H}(i))$ for Class $i$ for $i\in [\![3]\!]$. The axes are scaled differently for each class.
  • ...and 2 more figures

Theorems & Definitions (22)

  • Example 2.1
  • Example 2.2
  • Example 2.3
  • Example 2.4
  • remark 1
  • Example 2.5
  • remark 2
  • Example 2.6
  • Example 3.1
  • remark 3
  • ...and 12 more