Table of Contents
Fetching ...

Too Many or Too Few? Sampling Bounds for Topological Descriptors

Brittany Terese Fasy, Maksym Makarchuk, Samuel Micka, David L. Millman

TL;DR

This work investigates how many directional topological descriptors are needed to faithfully represent geometric simplicial complexes via transforms like the persistence diagram and Euler characteristic transform. It develops lower bounds showing that some shapes inherently require a linear number of descriptors in the number of vertices, while also analyzing practical discretization strategies through extensive experiments on synthetic and real-world data. The paper combines constructive proofs with empirical studies to illustrate the trade-offs between oversampling (too many directions) and undersampling (too few directions), including a formal loss scenario where different shapes become indistinguishable under a fixed descriptor set. The findings highlight that, although theory suggests large sampling to guarantee faithfulness, in practice small, carefully chosen direction sets often suffice, with clear implications for the efficiency and reliability of topological data analysis pipelines in applications. $PHT$, $ECT$, $K$, $LLS$, $s$, $d$, and $\Delta$-based discretizations are central to the framework and results.

Abstract

Topological descriptors, such as the Euler characteristic function and the persistence diagram, have grown increasingly popular for representing complex data. Recent work showed that a carefully chosen set of these descriptors encodes all of the geometric and topological information about a shape in R^d. In practice, epsilon nets are often used to find samples in one of two extremes. On one hand, making strong geometric assumptions about the shape allows us to choose epsilon small enough (corresponding to a high enough density sample) in order to guarantee a faithful representation, resulting in oversampling. On the other hand, if we choose a larger epsilon in order to allow faster computations, this leads to an incomplete description of the shape and a discretized transform that lacks theoretical guarantees. In this work, we investigate how many directions are really needed to represent geometric simplicial complexes, exploring both synthetic and real-world datasets. We provide constructive proofs that help establish size bounds and an experimental investigation giving insights into the consequences of over- and undersampling.

Too Many or Too Few? Sampling Bounds for Topological Descriptors

TL;DR

This work investigates how many directional topological descriptors are needed to faithfully represent geometric simplicial complexes via transforms like the persistence diagram and Euler characteristic transform. It develops lower bounds showing that some shapes inherently require a linear number of descriptors in the number of vertices, while also analyzing practical discretization strategies through extensive experiments on synthetic and real-world data. The paper combines constructive proofs with empirical studies to illustrate the trade-offs between oversampling (too many directions) and undersampling (too few directions), including a formal loss scenario where different shapes become indistinguishable under a fixed descriptor set. The findings highlight that, although theory suggests large sampling to guarantee faithfulness, in practice small, carefully chosen direction sets often suffice, with clear implications for the efficiency and reliability of topological data analysis pipelines in applications. , , , , , , and -based discretizations are central to the framework and results.

Abstract

Topological descriptors, such as the Euler characteristic function and the persistence diagram, have grown increasingly popular for representing complex data. Recent work showed that a carefully chosen set of these descriptors encodes all of the geometric and topological information about a shape in R^d. In practice, epsilon nets are often used to find samples in one of two extremes. On one hand, making strong geometric assumptions about the shape allows us to choose epsilon small enough (corresponding to a high enough density sample) in order to guarantee a faithful representation, resulting in oversampling. On the other hand, if we choose a larger epsilon in order to allow faster computations, this leads to an incomplete description of the shape and a discretized transform that lacks theoretical guarantees. In this work, we investigate how many directions are really needed to represent geometric simplicial complexes, exploring both synthetic and real-world datasets. We provide constructive proofs that help establish size bounds and an experimental investigation giving insights into the consequences of over- and undersampling.

Paper Structure

This paper contains 17 sections, 6 theorems, 13 equations, 8 figures.

Key Result

Corollary 7

Let $K$ be a geometric simplicial complex in $\mathbb{R}^{d+1}$. The diameter of the smallest $d$-dimensional stratum in the coarse stratification of $K$ is equal to the smallest angle between vectors with endpoints in the vertex set. Furthermore, if $\theta := \min\{ \angle u,v,w ~|~ uvw \in K_0\}$

Figures (8)

  • Figure 1: Example of a simplicial complex with three vertices, three edges, and one triangle. In the $y$-direction, the corresponding filtration sees three distinct simplicial complexes: $\{v_1\} \subset \{v_1,v_2, [v_1,v_2]\} \subset \{v_1,v_2,v_3,[v_1,v_2],[v_1,v_3],[v_2,v_3],[v_1,v_2,v_3]\}$. However, the lower-star filtration in the $y$-direction only sees one topological change, when a a new connected component is introduce at the height of $v_1$.
  • Figure 2: A simplicial complex with $15$ vertices. For any fundamental descriptor type, at least five topological descriptors are needed to uniquely represent this simplicial complex; in general, there exist configurations of $n$ simplices that require $\Omega (n_0)$ topological descriptors for a faithful representation.
  • Figure 3: Log-log plot of the smallest stratum size versus the number of vertices for datasets RANDPTS, EMNIST$_{.001}$, and MPEG7$_{.001}$.
  • Figure 4: Log-log plot of the smallest stratum size versus the number of vertices for dataset EMNIST$_{.005}$.
  • Figure 5: Plot of the ratio of hit stratum over the total number of strata versus the number of vertices for RANDPTS, EMNIST$_{.001}$, and MPEG7$_{.001}$.
  • ...and 3 more figures

Theorems & Definitions (15)

  • Definition 2: Discrete Topological Transform
  • Definition 3: Observable
  • Definition 4: $\theta$-Observable
  • Definition 5: Observing Region
  • Definition 6: Coarse Stratification
  • Corollary 7: Relation Between $K$ and $\theta$
  • Lemma 8: Sufficient Faithful Set
  • Theorem 9: Sufficient Comparison Set
  • proof
  • Lemma 10: Observing Regions Witness Local Max and Min
  • ...and 5 more