Too Many or Too Few? Sampling Bounds for Topological Descriptors
Brittany Terese Fasy, Maksym Makarchuk, Samuel Micka, David L. Millman
TL;DR
This work investigates how many directional topological descriptors are needed to faithfully represent geometric simplicial complexes via transforms like the persistence diagram and Euler characteristic transform. It develops lower bounds showing that some shapes inherently require a linear number of descriptors in the number of vertices, while also analyzing practical discretization strategies through extensive experiments on synthetic and real-world data. The paper combines constructive proofs with empirical studies to illustrate the trade-offs between oversampling (too many directions) and undersampling (too few directions), including a formal loss scenario where different shapes become indistinguishable under a fixed descriptor set. The findings highlight that, although theory suggests large sampling to guarantee faithfulness, in practice small, carefully chosen direction sets often suffice, with clear implications for the efficiency and reliability of topological data analysis pipelines in applications. $PHT$, $ECT$, $K$, $LLS$, $s$, $d$, and $\Delta$-based discretizations are central to the framework and results.
Abstract
Topological descriptors, such as the Euler characteristic function and the persistence diagram, have grown increasingly popular for representing complex data. Recent work showed that a carefully chosen set of these descriptors encodes all of the geometric and topological information about a shape in R^d. In practice, epsilon nets are often used to find samples in one of two extremes. On one hand, making strong geometric assumptions about the shape allows us to choose epsilon small enough (corresponding to a high enough density sample) in order to guarantee a faithful representation, resulting in oversampling. On the other hand, if we choose a larger epsilon in order to allow faster computations, this leads to an incomplete description of the shape and a discretized transform that lacks theoretical guarantees. In this work, we investigate how many directions are really needed to represent geometric simplicial complexes, exploring both synthetic and real-world datasets. We provide constructive proofs that help establish size bounds and an experimental investigation giving insights into the consequences of over- and undersampling.
