Table of Contents
Fetching ...

Efficient Distribution Learning with Error Bounds in Wasserstein Distance

Eduardo Figueiredo, Steven Adams, Luca Laurenti

TL;DR

This work addresses learning an unknown distribution $\mathbb{P}$ from finite samples by constructing a discrete surrogate $\widehat{\mathbb{P}}$ and providing non-asymptotic, data-driven bounds on the Wasserstein distance ${\mathbb{W}}_{\rho}(\mathbb{P}, {\widehat{\mathbb{P}}})$. It introduces a framework that combines optimal transport, quantization, and concentration inequalities to bound the distance via a tractable mixed-integer linear program whose complexity scales with the discrete support size $M$. A data-driven partition construction paired with a Lloyd-type clustering yields compact discrete approximations with high-confidence error guarantees, adapting to observed data. Empirically, the method yields substantially tighter Wasserstein bounds and smaller supports than state-of-the-art approaches across synthetic and real datasets (e.g., MiniBooNE and OCTMNIST), enabling efficient uncertainty propagation and distributionally robust optimization in practice.

Abstract

The Wasserstein distance has emerged as a key metric to quantify distances between probability distributions, with applications in various fields, including machine learning, control theory, decision theory, and biological systems. Consequently, learning an unknown distribution with non-asymptotic and easy-to-compute error bounds in Wasserstein distance has become a fundamental problem in many fields. In this paper, we devise a novel algorithmic and theoretical framework to approximate an unknown probability distribution $\mathbb{P}$ from a finite set of samples by an approximate discrete distribution $\widehat{\mathbb{P}}$ while bounding the Wasserstein distance between $\mathbb{P}$ and $\widehat{\mathbb{P}}$. Our framework leverages optimal transport, nonlinear optimization, and concentration inequalities. In particular, we show that, even if $\mathbb{P}$ is unknown, the Wasserstein distance between $\mathbb{P}$ and $\widehat{\mathbb{P}}$ can be efficiently bounded with high confidence by solving a tractable optimization problem (a mixed integer linear program) of a size that only depends on the size of the support of $\widehat{\mathbb{P}}$. This enables us to develop intelligent clustering algorithms to optimally find the support of $\widehat{\mathbb{P}}$ while minimizing the Wasserstein distance error. On a set of benchmarks, we demonstrate that our approach outperforms state-of-the-art comparable methods by generally returning approximating distributions with substantially smaller support and tighter error bounds.

Efficient Distribution Learning with Error Bounds in Wasserstein Distance

TL;DR

This work addresses learning an unknown distribution from finite samples by constructing a discrete surrogate and providing non-asymptotic, data-driven bounds on the Wasserstein distance . It introduces a framework that combines optimal transport, quantization, and concentration inequalities to bound the distance via a tractable mixed-integer linear program whose complexity scales with the discrete support size . A data-driven partition construction paired with a Lloyd-type clustering yields compact discrete approximations with high-confidence error guarantees, adapting to observed data. Empirically, the method yields substantially tighter Wasserstein bounds and smaller supports than state-of-the-art approaches across synthetic and real datasets (e.g., MiniBooNE and OCTMNIST), enabling efficient uncertainty propagation and distributionally robust optimization in practice.

Abstract

The Wasserstein distance has emerged as a key metric to quantify distances between probability distributions, with applications in various fields, including machine learning, control theory, decision theory, and biological systems. Consequently, learning an unknown distribution with non-asymptotic and easy-to-compute error bounds in Wasserstein distance has become a fundamental problem in many fields. In this paper, we devise a novel algorithmic and theoretical framework to approximate an unknown probability distribution from a finite set of samples by an approximate discrete distribution while bounding the Wasserstein distance between and . Our framework leverages optimal transport, nonlinear optimization, and concentration inequalities. In particular, we show that, even if is unknown, the Wasserstein distance between and can be efficiently bounded with high confidence by solving a tractable optimization problem (a mixed integer linear program) of a size that only depends on the size of the support of . This enables us to develop intelligent clustering algorithms to optimally find the support of while minimizing the Wasserstein distance error. On a set of benchmarks, we demonstrate that our approach outperforms state-of-the-art comparable methods by generally returning approximating distributions with substantially smaller support and tighter error bounds.
Paper Structure (30 sections, 6 theorems, 43 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 30 sections, 6 theorems, 43 equations, 8 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

Given a confidence $\beta >0$ and a partition $\{ C_1,...,C_M \}$ with representative points $\{{{c}}_1,\hdots,{{c}}_M\}$, associate to each set $C_i$ parameters ${p_{\ell}^{(i)}},{p_{u}^{(i)}}$, where ${p_{\ell}^{(i)}}=0$ if $\sum_{n=1}^{N} {\mathbbm{1}}_{C_i}({{x}}_n)=0$, otherwise ${p_{\ell}^{(i and ${p_{u}^{(i)}}=1$ if $\sum_{n=1}^{N} {\mathbbm{1}}_{C_i}(x_n)=1$, otherwise ${p_{u}^{(i)}}$ is

Figures (8)

  • Figure 1: Illustration of the data-driven construction of the discrete approximation $\widehat{\mathbb{P}}$ as defined in \ref{['eq:clusterized-empirical']} with $M=30$ for an unknown 2D uniform distribution $\mathbb{P}$ with true support $[0.3, 0.3]^2$ and assumed support ${\mathcal{X}}=[0.5, 0.5]^2$ (black box), using two independent datasets ${\mathcal{D}}_{N_{\mathrm{train}}}$ ($N_{\mathrm{train}}=5 \times 10^3$, green dots) and ${\mathcal{D}}_N$ ($N=10^4$, blue dots). (\ref{['subfig:step1']}) $(M-1)$-means clustering is applied to ${\mathcal{D}}_{N_{\mathrm{train}}}$ to obtain representative points $\{c_i\}_{i=1}^{M-1}$ (black dots). (\ref{['subfig:step2']}) Radii $\{r_i\}$ are computed based on ${\mathcal{D}}_{N_{\mathrm{train}}}$ as in Algorithm \ref{['alg:approximate-construction']}, lines 3-5, defining regions $\{C_i\}_{i=1}^{M-1}$ (colored areas). (\ref{['subfig:step3']}) The probability vector ${{\pi}}$ is computed by counting, for each region, the coverage of dataset ${\mathcal{D}}_N$ (blue dots). (\ref{['subfig:step4']}) Clopper-Pearson confidence intervals $[{p_{\ell}^{(i)}}, {p_{u}^{(i)}}]$ for the probability of each region $i$ are computed according to \ref{['eq:clopper-pearson-lower-bound']} and \ref{['eq:clopper-pearson-upper-bound']} (blue and red arrows). The regions are contained within $L_m$-balls centered at ${{c}}_i$ with radii $r_i$ (grey circles).
  • Figure 2: Illustration of the resulting representation points $\{{{c}}_i\}_{i=1}^M$ (black dots) and partition $\{C_i\}_{i=1}^M$ for different values of $M$ using $N_{\mathrm{train}}=5\times 10^3$ samples from an unknown bimodal 2D Gaussian mixture distribution with support truncated to ${\mathcal{X}}=[-0.5,0.5]^2$. The independent dataset ${\mathcal{D}}_N$ with $N=10^4$ (blue dots) is shown for validation of the resulting partition.
  • Figure 3: The computation times as the support size $M$ increases for multiple number of samples $N$ in the 2D and 100D Gaussian setting (for $\rho=2$).
  • Figure 4: Ours and Fournier bounds (dashed line) for $\rho=2$ for 2D and 10D isotropic Gaussian distributions with decreasing variance for each dimension for $N=10^4$.
  • Figure 5: The ratio between ours and Fournier bounds for increasing dimension $d$ for a Uniform (for $\rho=1$) and isotropic Gaussian (for $\rho=2$) distribution. In $2$D, the Uniform distribution has an $L_\infty$-ball support of diameter $0.2$, and the Gaussian distribution has standard deviation $0.2$ per dimension. For higher dimensions, the support diameter and variance are scaled to preserve a fixed probability mass ratio w.r.t. the support ${\mathcal{X}}$ across dimensions (see Section \ref{['subsection:experimental-details']}). The support size $M$ is selected from a grid ranging from $5$ to $1000$ so as to minimize the bound for each setting, as reported in Table \ref{['table:uniform_gaussian_over_dims']} in the Appendix.
  • ...and 3 more figures

Theorems & Definitions (15)

  • Theorem 3.1
  • Remark 1: Comparison with existing literature
  • Proposition 3.2
  • Remark 2: Finding the vertex $v$
  • Proposition 3.3
  • Remark 3
  • Proposition 7.1
  • proof
  • proof
  • proof
  • ...and 5 more