Statistical learning on measures: an application to persistence diagrams
Olympio Hacquard, Gilles Blanchard, Clément Levrard
TL;DR
The paper develops a theory for learning where inputs are measures on a compact space, deriving complexity and generalization bounds that connect measure-based classifiers to base classifiers on $\mathcal{X}$ and covering/Rademacher notions. It treats both discrete ($\mathcal{M}_m(\mathcal{X})$) and generic measure inputs, providing MIL-inspired VC bounds and Lipschitz-loss contraction bounds, and proposes practical rectangle-based algorithms with a learnable aggregation. The primary application is persistence diagrams from Topological Data Analysis, where the authors prove rectangle-discriminability results and analyze the asymptotic convergence of rescaled diagrams to limiting measures, enabling consistent classification under density and sampling changes. Empirically, the method—especially with boosting—achieves competitive accuracy on PD tasks and other datasets (point clouds, graphs, flow cytometry, time series) while offering explainability through interpretable discriminative regions in diagrams. The framework favors vectorization-free learning on measures, offers sub-sampling strategies for scalability, and opens avenues for extensions to multi-class or unsupervised settings with solid theoretical underpinnings and practical interpretability.
Abstract
We consider a binary supervised learning classification problem where instead of having data in a finite-dimensional Euclidean space, we observe measures on a compact space $\mathcal{X}$. Formally, we observe data $D_N = (μ_1, Y_1), \ldots, (μ_N, Y_N)$ where $μ_i$ is a measure on $\mathcal{X}$ and $Y_i$ is a label in $\{0, 1\}$. Given a set $\mathcal{F}$ of base-classifiers on $\mathcal{X}$, we build corresponding classifiers in the space of measures. We provide upper and lower bounds on the Rademacher complexity of this new class of classifiers that can be expressed simply in terms of corresponding quantities for the class $\mathcal{F}$. If the measures $μ_i$ are uniform over a finite set, this classification task boils down to a multi-instance learning problem. However, our approach allows more flexibility and diversity in the input data we can deal with. While such a framework has many possible applications, this work strongly emphasizes on classifying data via topological descriptors called persistence diagrams. These objects are discrete measures on $\mathbb{R}^2$, where the coordinates of each point correspond to the range of scales at which a topological feature exists. We will present several classifiers on measures and show how they can heuristically and theoretically enable a good classification performance in various settings in the case of persistence diagrams.
