Table of Contents
Fetching ...

Scalable Learning from Probability Measures with Mean Measure Quantization

Erell Gachon, Elsa Cazelles, Jérémie Bigot

Abstract

We consider statistical learning problems in which data are observed as a set of probability measures. Optimal transport (OT) is a popular tool to compare and manipulate such objects, but its computational cost becomes prohibitive when the measures have large support. We study a quantization-based approach in which all input measures are approximated by $K$-point discrete measures sharing a common support. We establish consistency of the resulting quantized measures. We further derive convergence guarantees for several OT-based downstream tasks computed from the quantized measures. Numerical experiments on synthetic and real datasets demonstrate that the proposed approach achieves performance comparable to individual quantization while substantially reducing runtime.

Scalable Learning from Probability Measures with Mean Measure Quantization

Abstract

We consider statistical learning problems in which data are observed as a set of probability measures. Optimal transport (OT) is a popular tool to compare and manipulate such objects, but its computational cost becomes prohibitive when the measures have large support. We study a quantization-based approach in which all input measures are approximated by -point discrete measures sharing a common support. We establish consistency of the resulting quantized measures. We further derive convergence guarantees for several OT-based downstream tasks computed from the quantized measures. Numerical experiments on synthetic and real datasets demonstrate that the proposed approach achieves performance comparable to individual quantization while substantially reducing runtime.

Paper Structure

This paper contains 37 sections, 15 theorems, 98 equations, 6 figures.

Key Result

Proposition 3.1

Let $(\mu^{(i)})_{1\leq i\leq N}$ be arbitrary probability measures with support included in a compact set $\mathcal{X}\subset\mathbb{R}^d$ and let $\overline{\mu} = \frac{1}{N}\sum_{i=1}^N \mu^{(i)}$ be the mean measure. Suppose that the cardinality of the support of $\overline{\mu}$ is larger than

Figures (6)

  • Figure 1: Convergence of downstream quantities for the Gaussian synthetic dataset. Results are averaged over 20 independent trials and reported with 95% confidence interval.
  • Figure 2: Rare-population synthetic dataset: comparison of mean-measure quantization and individual quantization with the same number of centers ($K=5$). The black circles represent the centers, and their radii are proportional to the corresponding weights.
  • Figure 3: Flow cytometry dataset. LDA classification accuracy and executions times against the number of clusters $K$.
  • Figure 4: Flow cytometry dataset. Projection of the $N=108$ measures on the first two components of PCA.
  • Figure 5: Earth image dataset. Examples of images sampled from the Airbus dataset.
  • ...and 1 more figures

Theorems & Definitions (32)

  • Proposition 3.1
  • Remark 3.2: On the compactness assumption of $\mathcal{X}$
  • Theorem 3.3
  • Remark 3.4
  • Proposition 3.5: Pairwise distances
  • Proposition 4.1: Wasserstein barycenter
  • Proposition 4.2: Statistical dispersion
  • Proposition 4.3
  • Proposition 4.4: Covariance operator
  • Remark 4.5
  • ...and 22 more