Statistical learning on measures: an application to persistence diagrams

Olympio Hacquard; Gilles Blanchard; Clément Levrard

Statistical learning on measures: an application to persistence diagrams

Olympio Hacquard, Gilles Blanchard, Clément Levrard

TL;DR

The paper develops a theory for learning where inputs are measures on a compact space, deriving complexity and generalization bounds that connect measure-based classifiers to base classifiers on $\mathcal{X}$ and covering/Rademacher notions. It treats both discrete ($\mathcal{M}_m(\mathcal{X})$) and generic measure inputs, providing MIL-inspired VC bounds and Lipschitz-loss contraction bounds, and proposes practical rectangle-based algorithms with a learnable aggregation. The primary application is persistence diagrams from Topological Data Analysis, where the authors prove rectangle-discriminability results and analyze the asymptotic convergence of rescaled diagrams to limiting measures, enabling consistent classification under density and sampling changes. Empirically, the method—especially with boosting—achieves competitive accuracy on PD tasks and other datasets (point clouds, graphs, flow cytometry, time series) while offering explainability through interpretable discriminative regions in diagrams. The framework favors vectorization-free learning on measures, offers sub-sampling strategies for scalability, and opens avenues for extensions to multi-class or unsupervised settings with solid theoretical underpinnings and practical interpretability.

Abstract

We consider a binary supervised learning classification problem where instead of having data in a finite-dimensional Euclidean space, we observe measures on a compact space $\mathcal{X}$. Formally, we observe data $D_N = (μ_1, Y_1), \ldots, (μ_N, Y_N)$ where $μ_i$ is a measure on $\mathcal{X}$ and $Y_i$ is a label in $\{0, 1\}$. Given a set $\mathcal{F}$ of base-classifiers on $\mathcal{X}$, we build corresponding classifiers in the space of measures. We provide upper and lower bounds on the Rademacher complexity of this new class of classifiers that can be expressed simply in terms of corresponding quantities for the class $\mathcal{F}$. If the measures $μ_i$ are uniform over a finite set, this classification task boils down to a multi-instance learning problem. However, our approach allows more flexibility and diversity in the input data we can deal with. While such a framework has many possible applications, this work strongly emphasizes on classifying data via topological descriptors called persistence diagrams. These objects are discrete measures on $\mathbb{R}^2$, where the coordinates of each point correspond to the range of scales at which a topological feature exists. We will present several classifiers on measures and show how they can heuristically and theoretically enable a good classification performance in various settings in the case of persistence diagrams.

Statistical learning on measures: an application to persistence diagrams

TL;DR

The paper develops a theory for learning where inputs are measures on a compact space, deriving complexity and generalization bounds that connect measure-based classifiers to base classifiers on

and covering/Rademacher notions. It treats both discrete (

) and generic measure inputs, providing MIL-inspired VC bounds and Lipschitz-loss contraction bounds, and proposes practical rectangle-based algorithms with a learnable aggregation. The primary application is persistence diagrams from Topological Data Analysis, where the authors prove rectangle-discriminability results and analyze the asymptotic convergence of rescaled diagrams to limiting measures, enabling consistent classification under density and sampling changes. Empirically, the method—especially with boosting—achieves competitive accuracy on PD tasks and other datasets (point clouds, graphs, flow cytometry, time series) while offering explainability through interpretable discriminative regions in diagrams. The framework favors vectorization-free learning on measures, offers sub-sampling strategies for scalability, and opens avenues for extensions to multi-class or unsupervised settings with solid theoretical underpinnings and practical interpretability.

Abstract

We consider a binary supervised learning classification problem where instead of having data in a finite-dimensional Euclidean space, we observe measures on a compact space

. Formally, we observe data

where

is a measure on

and

is a label in

. Given a set

of base-classifiers on

, we build corresponding classifiers in the space of measures. We provide upper and lower bounds on the Rademacher complexity of this new class of classifiers that can be expressed simply in terms of corresponding quantities for the class

. If the measures

are uniform over a finite set, this classification task boils down to a multi-instance learning problem. However, our approach allows more flexibility and diversity in the input data we can deal with. While such a framework has many possible applications, this work strongly emphasizes on classifying data via topological descriptors called persistence diagrams. These objects are discrete measures on

, where the coordinates of each point correspond to the range of scales at which a topological feature exists. We will present several classifiers on measures and show how they can heuristically and theoretically enable a good classification performance in various settings in the case of persistence diagrams.

Paper Structure (24 sections, 8 theorems, 88 equations, 7 figures, 5 tables)

This paper contains 24 sections, 8 theorems, 88 equations, 7 figures, 5 tables.

Introduction
Statistical learning on measures
Model
Theoretical complexity bounds
Discrete measures, 0-1 loss
Generic measures, Lipschitz loss
Algorithms, application to rectangle-based classification
A leading case study: classifying persistence diagrams
An introduction to persistence diagrams
Structural properties of persistence diagrams
Examples
Quantitative experiments
Persistence diagrams
Other datasets
Discussion
...and 9 more sections

Key Result

Proposition 2.1

Assume all the input measures belong to $\mathcal{M}_m (\mathcal{X})$. Assume $\psi$ is taken from a class $\mathcal{G}$ of permutation invariant functions and that the corresponding $\bar{\psi}$ is taken from a class $\bar{\mathcal{G}}$ of VC-dimension $d^\prime$. We further assume that the class $

Figures (7)

Figure 1: 0, 1 and 2-persistence diagrams for $n$ points uniformly sampled on a torus.
Figure 2: Data to classify. Yellow: torus, purple: sphere.
Figure 3: Best rectangle to classify points from a sphere or a torus.
Figure 4: Boosting for manifold classification.
Figure 5: Stability with respect to sampling noise
...and 2 more figures

Theorems & Definitions (15)

Proposition 2.1
Lemma 2.2
Theorem 2.3
Theorem 2.4
Example
Proposition 2.5
Example
Definition 3.1
Definition 3.2
Definition 3.3
...and 5 more

Statistical learning on measures: an application to persistence diagrams

TL;DR

Abstract

Statistical learning on measures: an application to persistence diagrams

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (15)